java | Genuine Blog

Fast-write is generally a characteristic strength of distributed NoSQL databases such as HBase, Cassandra. Yet, for a distributed application that needs to capture rapid streams of data in a database, standard connection pooling provided by the database might not be up to the task. For instance, I didn’t get the kind of wanted performance when using HBase’s HTablePool to accommodate real-time streaming of data from a high-parallelism data dumping Storm bolt.

To dump rapid real-time streaming data into HBase, instead of HTablePool it might be more efficient to embed some queueing mechanism in the HBase storage module. An ex-colleague of mine, who is the architect at a VoIP service provider, employs the very mechanism in their production HBase database. Below is a simple implementation that has been tested performing well with a good-sized Storm topology. The code is rather self-explanatory. The HBaseStreamers class consists of a threaded inner class, Streamer, which maintains a queue of HBase Put using LinkedBlockingQueue. Key parameters are in the HBaseStreamers constructor argument list, including the ZooKeeper quorum, HBase table name, HTable auto-flush switch, number of streaming queues and streaming queue capacity.

package hbstream;

import java.util.UUID;
import java.util.concurrent.LinkedBlockingQueue;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseStreamers {
    private Configuration hbaseConfig;
    private Streamer[] streamers;
    private boolean started = false;

    private class Streamer implements Runnable {
        private LinkedBlockingQueue<Put> queue;
        private HTable table;
        private String tableName;
        private int counter = 0;

        public Streamer(String tableName, boolean autoFlush, int capacity) throws Exception {
            table = new HTable(hbaseConfig, tableName);
            table.setAutoFlush(autoFlush);
            this.tableName = tableName;
            queue = new LinkedBlockingQueue<Put>(capacity);
        }

        public void run() {
            while (true) {
                try {
                    Put put = queue.take();
                    table.put(put);
                    counter++;
                }
                catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        public void write(Put put) throws Exception {
            queue.put(put);
        }

        public void flush() {
            if (!table.isAutoFlush()) {
                try {
                    table.flushCommits();
                }
                catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        public int size() {
            return queue.size();
        }

        public int counter() {
            return counter;
        }
    }

    public HBaseStreamers(String quorum, String port, String tableName, boolean autoFlush, int numOfStreamers, int capacity) throws Exception {
        hbaseConfig = HBaseConfiguration.create();
        hbaseConfig.set("hbase.zookeeper.quorum", quorum);
        hbaseConfig.set("hbase.zookeeper.property.clientPort", port);
        streamers = new Streamer[numOfStreamers];
        for (int i = 0; i < streamers.length; i++) {
            streamers[i] = new Streamer(tableName, autoFlush, capacity);
        }
    }

    public Runnable[] getStreamers() {
        return streamers;
    }

    public synchronized void start() {
        if (started) {
            return;
        }
        started = true;
        int count = 1;
        for (Streamer streamer : streamers) {
            new Thread(streamer, streamer.tableName + " HBStreamer " + count).start();
            count++;
        }
    }

    public void write(Put put) throws Exception {
        int i = (int) (System.currentTimeMillis() % streamers.length);
        streamers[i].write(put);
    }

    public void flush() {
        for (Streamer streamer : streamers) {
            streamer.flush();
        }
    }

    public int size() {
        int size = 0;
        for (Streamer st : streamers) {
            size += st.size();
        }
        return size;
    }

    public int counter() {
        int counter = 0;
        for (Streamer st : streamers) {
            counter += st.counter();
        }
        return counter;
    }
}

Next, write a wrapper class similar to the following to isolate HBase specifics from the streaming application.

package hbstream;
....

public class StreamToHBase {
    private static final String tableName = "stormhbtest";
    private static final byte[] colFamily = Bytes.toBytes("data");
    private static final byte[] colQualifier = Bytes.toBytes("message");
    private static boolean isInit = false;
    private static HBaseStreamers hbStreamers = null;
    ....

    public static synchronized void init(String zkQuorum, String zkPort, boolean autoFlush, int numOfStreamers, int queueCapacity)
        throws Exception {

        if (isInit == true)
            return;
        isInit = true;
        HBaseStreamers streamers = new HBaseStreamers(zkQuorum, zkPort, tableName, autoFlush, numOfStreamers, queueCapacity);
        streamers.start();
        hbStreamers = streamers;
        ....
    }

    public static void writeMessage(String message) throws Exception {
        byte[] value = Bytes.toBytes(message);
        byte[] rowIdBytes = Bytes.toBytes(UUID.randomUUID().toString());
        Put p = new Put(rowIdBytes);
        p.add(colFamily, colQualifier, value);
        if (hbStreamers != null) {
            hbStreamers.write(p);
        }
    }
    ....
}

To test it with a distributed streaming application using Storm, write a bolt similar to the following skeleton. All that is needed is to initialize HBaseStreamers from within the bolt’s prepare() method and dump data to HBase from within bolt’s execute().

package hbstream;
....

public class HBStreamTestBolt extends BaseRichBolt {
    OutputCollector _collector;
    ....

    @Override
    public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
        ....
        try {
            StreamToHBase.init("172.16.47.101, 172.16.47.102, 172.16.47.103", "2181", false, 4, 1000);
        }
        catch (Exception e) {
            ....
        }
        ....
    }

    @Override
    public void execute(Tuple tuple) {
        ....
        try {
            StreamToHBase.writeMessage(message);
        }
        catch (Exception e) {
            ....
        }
        ....
    }
    ....
}

Finally, write a Storm spout to serve as the streaming data source and a Storm topology builder to put the spout and bolt together.

package hbstream;
....

public class HBStreamTestSpout extends BaseRichSpout {
    SpoutOutputCollector _collector;
    ....

    @Override
    public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
        _collector = collector;
        words = new ArrayList();
        ....
    }

    @Override
    public void nextTuple() {
        int rand = (int) (Math.random() * 1000);
        String word = words.get(rand % words.size());
        _collector.emit(new Values(word));
        ....
    }
    ....
}

package hbstream;
....

public class HBStreamTopology {
    ....

    public static void main(String[] args) throws Exception {
        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("testSpout", new HBStreamTestSpout(), 4);
        builder.setBolt("testBolt", new HBStreamTestBolt(), 6)
               .shuffleGrouping("testSpout");

        Config conf = new Config();
        conf.setNumWorkers(2);
        StormSubmitter.submitTopology("HBStreamTopology", conf, builder.createTopology());
        ....
    }
}

The parallelism/queue parameters are set to relatively small numbers in the above sample code. Once tested working, one can tweak all the various dials in accordance with the server cluster capacity. These dials include the following:

StreamToHBase.init():
    - boolean autoFlush
    - int numOfStreamers
    - int queueCapacity

TopologyBuilder.setSpout():
    - Number parallelismHint

TopologyBuilder.setBolt():
    - Number parallelismHint

Config.setNumWorkers():
    - int workers

For simplicity, only HBase Put is being handled in the above implementation. It certainly can be expanded to handle also HBase Increment so as to carry out aggregation functions such as count. The primary goal of this Storm-to-HBase streaming exercise is to showcase the using of a module equipped with some “elasticity” by means of configurable queues. The queueing mechanism within HBaseStreamers provides cushioned funnels for the data streams and helps optimize the overall data intake bandwidth. Keep in mind, though, that doesn’t remove the need of administration work for a properly configured HBase-Hadoop system.

package sorting; public class Sorter<E extends Comparable<E>> { public void performSorting(E[] list, SimpleSort<E> sortingAlgo) { long startTime, elapsedTime; System.out.println("\nOriginal " + list.getClass().getSimpleName() + " list: ..."); for (int i = 0; i < list.length; i++) { System.out.print(list[i] + " "); } System.out.println(); startTime = System.currentTimeMillis(); sortingAlgo.sort(list); elapsedTime = System.currentTimeMillis() - startTime; System.out.println("\nResult: " + sortingAlgo.getClass().getSimpleName() + " ..."); for (int i = 0; i < list.length; i++) { System.out.print(list[i] + " "); } System.out.println("\n\nTime taken: " + elapsedTime + "ms"); } }

package sorting; public class MergeSort<E extends Comparable<E>> implements SimpleSort<E> { private E[] list; private E[] holder; @SuppressWarnings({"unchecked"}) public void sort(E[] list) { this.list = list; int size = list.length; // Type erasure: first bound class of <E extends Comparable<E>> holder = (E[]) new Comparable[size]; mergeSort(0, size - 1); } private void mergeSort(int left, int right) { if (left < right) { int center = left + (right - left) / 2; mergeSort(left, center); mergeSort(center + 1, right); merge(left, center, right); } } private void merge(int left, int center, int right) { for (int i = left; i <= right; i++) { holder[i] = list[i]; } int i = left; int j = center + 1; int k = left; while (i <= center && j <= right) { if (holder[i].compareTo(holder[j]) <= 0) { list[k] = holder[i]; // Overwrite list[k] with element from the left list i++; } else { list[k] = holder[j]; // Overwrite list[k] with element from the right list j++; } k++; } // Overwirte remaining list[k] if the left list isn't exhausted // No need to do the same for the right list, as its elements are already correctly placed while (i <= center) { list[k] = holder[i]; k++; i++; } } }

package sorting; import java.util.Random; public class QuickSort<E extends Comparable<E>> implements SimpleSort<E> { private E[] list; @SuppressWarnings({"unchecked"}) public void sort(E[] list) { this.list = list; int size = list.length; quickSort(list, 0, size-1); } private void swapListElements(E[] list, int index1, int index2) { E tempValue = list[index1]; list[index1] = list[index2]; list[index2] = tempValue; } private int partitionIndex(E[] list, int left, int right, int pivot) { int pivotIndex; E pivotValue; pivotValue = list[pivot]; swapListElements(list, pivot, right); // Swap pivot element to rightmost index pivotIndex = left; // Calculate pivot index starting from leftmost index for (int i = left; i < right; i++) { if (list[i].compareTo(pivotValue) < 0) { swapListElements(list, i, pivotIndex); pivotIndex++; } } swapListElements(list, pivotIndex, right); // Swap pivot element back to calculated pivot index return pivotIndex; } private void quickSort(E[] list, int left, int right) { int randomIndex, pivotIndex; if (left < right) { // Pick random index between left and right (inclusive) Random randomNum = new Random(); randomIndex = randomNum.nextInt(right-left+1) + left; pivotIndex = partitionIndex(list, left, right, randomIndex); quickSort(list, left, pivotIndex - 1); quickSort(list, pivotIndex + 1, right); } else return; } }

package sorting; public class SortingMain { @SuppressWarnings({"unchecked"}) public static void main(String[] args) { Integer[] integerList = {8, 15, 6, 2, 11, 4, 7, 12, 3}; Double[] doubleList = {2.5, 8.0, 7.2, 4.9, 6.6, 1.8, 3.0, 7.5}; Character[] characterList = {'M', 'X', 'B', 'C', 'T', 'R', 'F', 'S'}; String[] stringList = {"lion", "tiger", "snow leopard", "jaguar", "cheeta", "cougar", "leopard"}; Sorter sorter = new Sorter(); sorter.performSorting(integerList, new MergeSort()); sorter.performSorting(doubleList, new QuickSort()); sorter.performSorting(characterList, new MergeSort()); sorter.performSorting(stringList, new QuickSort()); } }

Genuine Blog

A Tech Blog by Leo Cheung

Tag Archives: java

Streaming Real-time Data Into HBase

Programming Exercise – Sorting Algorithm