Tag Archives: big data

Streaming Real-time Data Into HBase

Fast-write is generally a characteristic strength of distributed NoSQL databases such as HBase, Cassandra. Yet, for a distributed application that needs to capture rapid streams of data in a database, standard connection pooling provided by the database might not be up to the task. For instance, I didn’t get the kind of wanted performance when using HBase’s HTablePool to accommodate real-time streaming of data from a high-parallelism data dumping Storm bolt.

To dump rapid real-time streaming data into HBase, instead of HTablePool it might be more efficient to embed some queueing mechanism in the HBase storage module. An ex-colleague of mine, who is the architect at a VoIP service provider, employs the very mechanism in their production HBase database. Below is a simple implementation that has been tested performing well with a good-sized Storm topology. The code is rather self-explanatory. The HBaseStreamers class consists of a threaded inner class, Streamer, which maintains a queue of HBase Put using LinkedBlockingQueue. Key parameters are in the HBaseStreamers constructor argument list, including the ZooKeeper quorum, HBase table name, HTable auto-flush switch, number of streaming queues and streaming queue capacity.

package hbstream;

import java.util.UUID;
import java.util.concurrent.LinkedBlockingQueue;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseStreamers {
    private Configuration hbaseConfig;
    private Streamer[] streamers;
    private boolean started = false;

    private class Streamer implements Runnable {
        private LinkedBlockingQueue<Put> queue;
        private HTable table;
        private String tableName;
        private int counter = 0;

        public Streamer(String tableName, boolean autoFlush, int capacity) throws Exception {
            table = new HTable(hbaseConfig, tableName);
            table.setAutoFlush(autoFlush);
            this.tableName = tableName;
            queue = new LinkedBlockingQueue<Put>(capacity);
        }

        public void run() {
            while (true) {
                try {
                    Put put = queue.take();
                    table.put(put);
                    counter++;
                }
                catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        public void write(Put put) throws Exception {
            queue.put(put);
        }

        public void flush() {
            if (!table.isAutoFlush()) {
                try {
                    table.flushCommits();
                }
                catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        public int size() {
            return queue.size();
        }

        public int counter() {
            return counter;
        }
    }

    public HBaseStreamers(String quorum, String port, String tableName, boolean autoFlush, int numOfStreamers, int capacity) throws Exception {
        hbaseConfig = HBaseConfiguration.create();
        hbaseConfig.set("hbase.zookeeper.quorum", quorum);
        hbaseConfig.set("hbase.zookeeper.property.clientPort", port);
        streamers = new Streamer[numOfStreamers];
        for (int i = 0; i < streamers.length; i++) {
            streamers[i] = new Streamer(tableName, autoFlush, capacity);
        }
    }

    public Runnable[] getStreamers() {
        return streamers;
    }

    public synchronized void start() {
        if (started) {
            return;
        }
        started = true;
        int count = 1;
        for (Streamer streamer : streamers) {
            new Thread(streamer, streamer.tableName + " HBStreamer " + count).start();
            count++;
        }
    }

    public void write(Put put) throws Exception {
        int i = (int) (System.currentTimeMillis() % streamers.length);
        streamers[i].write(put);
    }

    public void flush() {
        for (Streamer streamer : streamers) {
            streamer.flush();
        }
    }

    public int size() {
        int size = 0;
        for (Streamer st : streamers) {
            size += st.size();
        }
        return size;
    }

    public int counter() {
        int counter = 0;
        for (Streamer st : streamers) {
            counter += st.counter();
        }
        return counter;
    }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

package hbstream;

import java.util.UUID;

import java.util.concurrent.LinkedBlockingQueue;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.util.Bytes;

public class HBaseStreamers {

private Configuration hbaseConfig;

private Streamer[] streamers;

private boolean started = false;

private class Streamer implements Runnable {

private LinkedBlockingQueue<Put> queue;

private HTable table;

private String tableName;

private int counter = 0;

public Streamer(String tableName, boolean autoFlush, int capacity) throws Exception {

table = new HTable(hbaseConfig, tableName);

table.setAutoFlush(autoFlush);

this.tableName = tableName;

queue = new LinkedBlockingQueue<Put>(capacity);

}

public void run() {

while (true) {

try {

Put put = queue.take();

table.put(put);

counter++;

}

catch (Exception e) {

e.printStackTrace();

}

public void write(Put put) throws Exception {

queue.put(put);

}

public void flush() {

if (!table.isAutoFlush()) {

try {

table.flushCommits();

}

catch (Exception e) {

e.printStackTrace();

}

public int size() {

return queue.size();

}

public int counter() {

return counter;

}

public HBaseStreamers(String quorum, String port, String tableName, boolean autoFlush, int numOfStreamers, int capacity) throws Exception {

hbaseConfig = HBaseConfiguration.create();

hbaseConfig.set("hbase.zookeeper.quorum", quorum);

hbaseConfig.set("hbase.zookeeper.property.clientPort", port);

streamers = new Streamer[numOfStreamers];

for (int i = 0; i < streamers.length; i++) {

streamers[i] = new Streamer(tableName, autoFlush, capacity);

}

public Runnable[] getStreamers() {

return streamers;

}

public synchronized void start() {

if (started) {

return;

}

started = true;

int count = 1;

for (Streamer streamer : streamers) {

new Thread(streamer, streamer.tableName + " HBStreamer " + count).start();

count++;

}

public void write(Put put) throws Exception {

int i = (int) (System.currentTimeMillis() % streamers.length);

streamers[i].write(put);

}

public void flush() {

for (Streamer streamer : streamers) {

streamer.flush();

}

public int size() {

int size = 0;

for (Streamer st : streamers) {

size += st.size();

}

return size;

}

public int counter() {

int counter = 0;

for (Streamer st : streamers) {

counter += st.counter();

}

return counter;

}

Next, write a wrapper class similar to the following to isolate HBase specifics from the streaming application.

package hbstream;
....

public class StreamToHBase {
    private static final String tableName = "stormhbtest";
    private static final byte[] colFamily = Bytes.toBytes("data");
    private static final byte[] colQualifier = Bytes.toBytes("message");
    private static boolean isInit = false;
    private static HBaseStreamers hbStreamers = null;
    ....

    public static synchronized void init(String zkQuorum, String zkPort, boolean autoFlush, int numOfStreamers, int queueCapacity)
        throws Exception {

        if (isInit == true)
            return;
        isInit = true;
        HBaseStreamers streamers = new HBaseStreamers(zkQuorum, zkPort, tableName, autoFlush, numOfStreamers, queueCapacity);
        streamers.start();
        hbStreamers = streamers;
        ....
    }

    public static void writeMessage(String message) throws Exception {
        byte[] value = Bytes.toBytes(message);
        byte[] rowIdBytes = Bytes.toBytes(UUID.randomUUID().toString());
        Put p = new Put(rowIdBytes);
        p.add(colFamily, colQualifier, value);
        if (hbStreamers != null) {
            hbStreamers.write(p);
        }
    }
    ....
}

package hbstream;

....

public class StreamToHBase {

private static final String tableName = "stormhbtest";

private static final byte[] colFamily = Bytes.toBytes("data");

private static final byte[] colQualifier = Bytes.toBytes("message");

private static boolean isInit = false;

private static HBaseStreamers hbStreamers = null;

....

public static synchronized void init(String zkQuorum, String zkPort, boolean autoFlush, int numOfStreamers, int queueCapacity)

throws Exception {

if (isInit == true)

return;

isInit = true;

HBaseStreamers streamers = new HBaseStreamers(zkQuorum, zkPort, tableName, autoFlush, numOfStreamers, queueCapacity);

streamers.start();

hbStreamers = streamers;

....

}

public static void writeMessage(String message) throws Exception {

byte[] value = Bytes.toBytes(message);

byte[] rowIdBytes = Bytes.toBytes(UUID.randomUUID().toString());

Put p = new Put(rowIdBytes);

p.add(colFamily, colQualifier, value);

if (hbStreamers != null) {

hbStreamers.write(p);

}

....

}

To test it with a distributed streaming application using Storm, write a bolt similar to the following skeleton. All that is needed is to initialize HBaseStreamers from within the bolt’s prepare() method and dump data to HBase from within bolt’s execute().

package hbstream;
....

public class HBStreamTestBolt extends BaseRichBolt {
    OutputCollector _collector;
    ....

    @Override
    public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
        ....
        try {
            StreamToHBase.init("172.16.47.101, 172.16.47.102, 172.16.47.103", "2181", false, 4, 1000);
        }
        catch (Exception e) {
            ....
        }
        ....
    }

    @Override
    public void execute(Tuple tuple) {
        ....
        try {
            StreamToHBase.writeMessage(message);
        }
        catch (Exception e) {
            ....
        }
        ....
    }
    ....
}

package hbstream;

....

public class HBStreamTestBolt extends BaseRichBolt {

OutputCollector _collector;

....

@Override

public void prepare(Map conf, TopologyContext context, OutputCollector collector) {

....

try {

StreamToHBase.init("172.16.47.101, 172.16.47.102, 172.16.47.103", "2181", false, 4, 1000);

}

catch (Exception e) {

....

}

....

}

@Override

public void execute(Tuple tuple) {

....

try {

StreamToHBase.writeMessage(message);

}

catch (Exception e) {

....

}

....

}

....

}

Finally, write a Storm spout to serve as the streaming data source and a Storm topology builder to put the spout and bolt together.

package hbstream;
....

public class HBStreamTestSpout extends BaseRichSpout {
    SpoutOutputCollector _collector;
    ....

    @Override
    public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
        _collector = collector;
        words = new ArrayList<String>();
        ....
    }

    @Override
    public void nextTuple() {
        int rand = (int) (Math.random() * 1000);
        String word = words.get(rand % words.size());
        _collector.emit(new Values(word));
        ....
    }
    ....
}

package hbstream;

....

public class HBStreamTestSpout extends BaseRichSpout {

SpoutOutputCollector _collector;

....

@Override

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

_collector = collector;

words = new ArrayList<String>();

....

}

@Override

public void nextTuple() {

int rand = (int) (Math.random() * 1000);

String word = words.get(rand % words.size());

_collector.emit(new Values(word));

....

}

....

}

package hbstream;
....

public class HBStreamTopology {
    ....

    public static void main(String[] args) throws Exception {
        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("testSpout", new HBStreamTestSpout(), 4);
        builder.setBolt("testBolt", new HBStreamTestBolt(), 6)
               .shuffleGrouping("testSpout");

        Config conf = new Config();
        conf.setNumWorkers(2);
        StormSubmitter.submitTopology("HBStreamTopology", conf, builder.createTopology());
        ....
    }
}

package hbstream;

....

public class HBStreamTopology {

....

public static void main(String[] args) throws Exception {

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("testSpout", new HBStreamTestSpout(), 4);

builder.setBolt("testBolt", new HBStreamTestBolt(), 6)

.shuffleGrouping("testSpout");

Config conf = new Config();

conf.setNumWorkers(2);

StormSubmitter.submitTopology("HBStreamTopology", conf, builder.createTopology());

....

}

The parallelism/queue parameters are set to relatively small numbers in the above sample code. Once tested working, one can tweak all the various dials in accordance with the server cluster capacity. These dials include the following:

StreamToHBase.init():
    - boolean autoFlush
    - int numOfStreamers
    - int queueCapacity

TopologyBuilder.setSpout():
    - Number parallelismHint

TopologyBuilder.setBolt():
    - Number parallelismHint

Config.setNumWorkers():
    - int workers

StreamToHBase.init():

- boolean autoFlush

- int numOfStreamers

- int queueCapacity

TopologyBuilder.setSpout():

- Number parallelismHint

TopologyBuilder.setBolt():

- Number parallelismHint

Config.setNumWorkers():

- int workers

For simplicity, only HBase Put is being handled in the above implementation. It certainly can be expanded to handle also HBase Increment so as to carry out aggregation functions such as count. The primary goal of this Storm-to-HBase streaming exercise is to showcase the using of a module equipped with some “elasticity” by means of configurable queues. The queueing mechanism within HBaseStreamers provides cushioned funnels for the data streams and helps optimize the overall data intake bandwidth. Keep in mind, though, that doesn’t remove the need of administration work for a properly configured HBase-Hadoop system.

Real-time Big Data

1 Reply

Although demand for large scale distributed computing solutions has existed for decades, the term Big Data did not get a lot of public attention till Google published its data processing programming model, MapReduce, back in 2004. The Java-based Hadoop framework further popularized the term a couple of years later, partly due to the ubiquitous popularity of Java.

From Batch to Real-time

Hadoop has proven itself a great platform for running MapReduce applications on fault-tolerant HDFS clusters, typically composed of inexpensive servers. It does very well in the large-scale batch data processing problem domain. Adding a distributed database system such as HBase or Cassandra helps extend the platform to address the needs for real-time (or near-real-time) access to structured data. But in order to be able to use feature-rich messaging or streaming functionality, one will need some suitable systems that operate well on a distributed platform.

I remember feeling the need for such a real-time Big Data system when I was with a cleantech startup, EcoFactor, a couple of years ago seeking solutions to handle the increasingly demanding time-series data processing in a near-real-time fashion. It would have saved me and my team a lot of internal development work had such a system been available. One of my recent R&D projects after I left the company was to adopt suitable software components to address such real-time distributed data processing needs. The following highlights the key pieces I picked and what prompted me to pick them for the task.

Over the past couple of years, Kafka and Storm have emerged as two maturing distributed data processing systems. Kafka was developed by LinkedIn for high performance messaging, whereas Storm, developed by Twitter (through the acquisition of BackType), addresses the real-time streaming problem space. Although the two technologies were independently built, some real-time Big Data solution seekers see advantages bringing the two together by integrating Kafka as the source queue with a Storm streaming system. According to a published blog earlier this year, Twitter also uses the Kafka-Storm combo to handle its demanding real-time search.

High-performance Distributed Messaging

Kafka is a distributed publish-subscribe messaging system equipped with robust semantic partitioning and high messaging throughput by leveraging kernel-managed disk cache. High performance is evidently a key initiative in Kafka’s underlying architecture. It adopts the design principle that leverages kernel page caching to minimize data copying and context switching for higher messaging throughput. It also uses message grouping (MessageSet) to reduce network calls. At-least-once message processing is guaranteed. If exactly-once is a business requirement, one approach is to programmatically keep track of the messaging state by coupling the data with the message offset to eliminate duplication.

Kafka flows data by having the publishers push data to the brokers (Kafka servers) and subscribers pull from the brokers, giving the flexibility of a more diverse set of message consumers in the messaging system. It uses ZooKeeper for auto message broker discovery for non-static broker configurations. It also uses ZooKeeper to maintain message topics and partitions. Messages can be programmatically partitioned over a server cluster and consumed with ordering preserved within individual partitions. There are two APIs for the consumer – a high-level consumer (ConsumerConnector) that heavily leverages ZooKeeper to handle broker discovery, consumer rebalancing and message offset tracking; and a low-level consumer (SimpleConsumer) that allows users to manually customize all the key messaging features.

Setting up Kafka on a cluster is pretty straight forward. The version used in my project is 0.7.2. Each Kafka server is identified by a unique broker id. Each broker comes with tunable configurations on the socket server, logs and connections to a ZooKeeper ensemble (if enabled). There are also a handful of configurable properties that are producer-specific (e.g. producer.type, serializer.class, partitioner.class) and consumer-specific (e.g. autocommit.interval.ms).

The following links detail Kafka’s core design principles:
http://kafka.apache.org/07/design.html http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

Real-time Distributed Streaming

Storm is a distributed streaming system that streams data through a customizable topology of data sources (spouts) and processors (bolts). It uses ZooKeeper, as well, to coordinate among the master and worker nodes in a cluster and manage transactional state as needed. Streams (composed of tuples), spouts and bolts constitute the core components of a Storm topology. A set of stream grouping methods are provided for partitioning a stream among consuming bolts in accordance with various use cases. The Storm topology executes across multiple configurable worker processes, each of which is a JVM. A Thrift-based topology builder is used to define and submit a topology.

On reliability, Storm guarantees that every tuple from a spout gets fully processed. It manages the complete lifecycle of each input tuple by tracking its id (msgId) and using methods ack(Object msgId) and fail(Object msgId) to report processing result. By anchoring the input tuple from the spout to every associated tuple being emitted in the consuming bolts, the spout can replay the tuple in the event of failure. This ensures at-least-once message processing.

Transactional Stream Processing

Storm’s transactional topology goes another step further to ensure exactly-once message processing. It processes tuples by batches, each identified by a transaction id which is maintained in a persistent storage. A transaction is composed of two phases – processing phase and commit phase. The processing phase allows batches to be proceeded in parallel (if the specific business logic warrants it), whereas the commit phase requires batches to be strongly ordered. For a given batch, any failure during the processing or commit phases will result in a replay of the entire transaction. To avoid over-update to a replayed batch, the current transaction id is examined against the stored transaction id within the strong-ordered commit phase and persisted update takes place only when the the id’s differ.

Then, there is this high-level abstraction layer, Trident API, on top of Storm that helps internalize some state management effort. It also introduces opaque transaction spouts to address failure cases in which loss of partial data source forbids replaying of the batch. It achieves such fault tolerance by maintaining in persistent storage a previous-state computed value (e.g. word count) in additional to the current computed value and transaction id. The idea is to reliably carry over partial value changes across strong-ordered batches, allowing the failed tuples in a partially failed batch to be processed in a subsequent batch.

Deploying Storm on a production cluster requires a little extra effort. The version, 0.8.1, used in my project requires a dated version of ZeroMQ – a native socket/messaging library in C++, which in turn needs JZMQ for Java binding. To build ZeroMQ, UUID library (libuuid-devel) is needed as well. Within the cluster, the master node runs the “Nimbus” daemon and each of the worker nodes runs a “Supervisor” daemon. It also comes with a administrative web UI that is handy for status monitoring.

The following links provide details on the topics of Storm’s message reliability, transactional topology and Trident’s essentials:
https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing https://github.com/nathanmarz/storm/wiki/Transactional-topologies https://github.com/nathanmarz/storm/wiki/Trident-state

And, 1 + 1 = 1 …

Both Kafka and Storm are promising technologies in the real-time Big Data arena. While they work great individually, their functionalities also complement each other well. A commonly seen use case is to have Storm being the center piece of a streaming system with Kafka spouts providing queueing mechanism and data processing bolts carrying out specific business logic. If persistent storage is needed which is often the case, one can develop a bolt to persist data into a NoSQL database such as HBase or Cassandra.

These are all exciting technologies and are part of what makes contemporary open-source distributed computing software prosperous. Even though they’re promising, that doesn’t mean they’re suitable for every company that wants to run some real-time Big Data systems. At their current level of maturity, adopting them still requires some hands-on software technologists to objectively assess needs, design, implement and come up with infrastructure support plan.

Genuine Blog

A Tech Blog by Leo Cheung

Tag Archives: big data

Streaming Real-time Data Into HBase

Real-time Big Data

From Batch to Real-time

High-performance Distributed Messaging

Real-time Distributed Streaming

Transactional Stream Processing

And, 1 + 1 = 1 …