Tag Archives: java

Streaming Real-time Data Into HBase

Fast-write is generally a characteristic strength of distributed NoSQL databases such as HBase, Cassandra. Yet, for a distributed application that needs to capture rapid streams of data in a database, standard connection pooling provided by the database might not be up to the task. For instance, I didn’t get the kind of wanted performance when using HBase’s HTablePool to accommodate real-time streaming of data from a high-parallelism data dumping Storm bolt.

To dump rapid real-time streaming data into HBase, instead of HTablePool it might be more efficient to embed some queueing mechanism in the HBase storage module. An ex-colleague of mine, who is the architect at a VoIP service provider, employs the very mechanism in their production HBase database. Below is a simple implementation that has been tested performing well with a good-sized Storm topology. The code is rather self-explanatory. The HBaseStreamers class consists of a threaded inner class, Streamer, which maintains a queue of HBase Put using LinkedBlockingQueue. Key parameters are in the HBaseStreamers constructor argument list, including the ZooKeeper quorum, HBase table name, HTable auto-flush switch, number of streaming queues and streaming queue capacity.

Next, write a wrapper class similar to the following to isolate HBase specifics from the streaming application.

To test it with a distributed streaming application using Storm, write a bolt similar to the following skeleton. All that is needed is to initialize HBaseStreamers from within the bolt’s prepare() method and dump data to HBase from within bolt’s execute().

Finally, write a Storm spout to serve as the streaming data source and a Storm topology builder to put the spout and bolt together.

The parallelism/queue parameters are set to relatively small numbers in the above sample code. Once tested working, one can tweak all the various dials in accordance with the server cluster capacity. These dials include the following:

For simplicity, only HBase Put is being handled in the above implementation. It certainly can be expanded to handle also HBase Increment so as to carry out aggregation functions such as count. The primary goal of this Storm-to-HBase streaming exercise is to showcase the using of a module equipped with some “elasticity” by means of configurable queues. The queueing mechanism within HBaseStreamers provides cushioned funnels for the data streams and helps optimize the overall data intake bandwidth. Keep in mind, though, that doesn’t remove the need of administration work for a properly configured HBase-Hadoop system.

Programming Exercise – Sorting Algorithm

Unless algorithm development is part of the job, many software engineers use readily available algorithmic methods as needed and rarely need to develop algorithms themselves. I’m not talking about some complicated statistical or machine learning algorithms. Just simple mundane ones such as a sorting algorithm. Even if you don’t need to code algorithms, going back to writing a sorting program can still be a good exercise to review certain basic skills that might not be frequently used in your current day job. It’s a good-sized programming exercise that isn’t too trivial or taking up too much time. It also reminds you some clever (or dumb) tricks on how to perform sorting by means of recursive divide-and-merge, pivot partitioning, etc. And if nothing else, it might help you in your next technical job interview in the near future.

If you’re up for such an exercise, first, look up from Wikipedia or any other suitable source for a sorting algorithm (e.g. Merge Sort, Quick Sort) of your choice to re-familiarize yourself with its underlying mechanism. Next, decide on the scope of the application – for example, do you want an integer-sorting application or something more generic? … etc. Next, pseudo code, pick the programming language of your choice, and go for it.

Appended is a simple implementation of both Merge Sort and Quick Sort in Java. For the convenience of making method calls with varying data types and sorting algorithms, an interface (SimpleSort) and a wrapper (Sorter) are used. Java Generics is used to generalize the sorter for different data types. Adding Generics to a sorting application requires the using of either the Comparable or Comparator interface, as ordering is necessary in sorting. In this example application, the Comparable interface is used since the default ordering is good enough for basic sorting. The overall implementation code is pretty self explanatory.

SimpleSort.java

Sorter.java

MergeSort.java

QuickSort.java

SortingMain.java