Tag Archives: nosql database

Relational Database Redemption

Relational databases, such as PostgreSQL and Oracle, can be traced back to the 80’s when they became a dominant type of data management systems. Their prominence was further secured by the ANSI standardization of the domain specific language called SQL (Structured Query Language). Since then, RDBMS (relational database management system) has been the de facto component most data-centric applications would be architecturally centered around.

What happened to relational Databases?

It’s a little troubling, though, over the past 10-15 years, I’ve witnessed relational databases being sidelined from the core functionality requirement review or architectural design in many software engineering projects that involve data-centric applications. In particular, other kinds of databases would often be favored for no good reasons. And when relational databases are part of the core technology stack, thorough data model design would often be skipped and using of SQL would often be avoided in cases where it would be highly efficient.

So, why have relational databases been treated with such noticeably less preference or seriousness? I believe a couple of causes have led to the phenomenon.

Object-oriented data persistence architecture

First, there was a shift in application architecture in late 90’s when object-oriented programming began to increasingly dominate in the computing world. In particular, the backend data persistence component of object-oriented applications began to take over the heavy lifting of the database CRUD (create/read/update/delete) operations which used to reside within the database tier via SQL or procedural language PL/SQL.

Java EJB (enterprise Java bean), which was aimed to emulate data persistence and query functionality among other things, took the object-oriented programming world by storm. ORM (object-relational mapping) then further helped keep software engineers completely inside the Object world. Realizing that the initial EJB specifications were over-engineered, it later evolved into JPA (Java Persistence API) which also incorporates ORM functionality. All that doesn’t eliminate the need of relational databases, but engineering design focus has since been pulled away from the database tier and SQL has been treated as if it was irrelevant.

NoSQL databases

Then, in late 00’s came column-oriented NoSQL databases like HBase and Cassandra, which were designed to primarily handle large-scale datasets. Designed to run on scalable distributed computing platforms, these databases are great for handling Big Data at the scale that conventional relational databases would have a hard time to perform well.

Meanwhile, document-based NoSQL databases like MongoDB also emerged and have increasingly been adopted as part of the core technology stack by software engineers. These NoSQL databases have all of a sudden stole the spotlight in the database world. Relational databases were further perceptually “demoted” and SQL wouldn’t look right without a negation prefix.

Object-oriented data persistence versus SQL, PL/SQL

Just to be clear, I’m not against having the data persistence layer of the application handle the business logic of data manipulations and queries within the Object world. In fact, I think it makes perfect sense to keep data access business logic within the application tier using the same object-oriented programming paradigm, shielding software engineers from having to directly deal with things in the disparate SQL world.

Another huge benefit of using the object-oriented data persistence is that it takes advantage of any scaling mechanism provided by the application servers (especially for those on distributed computing platforms), rather than, say, relying everything on database-resident PL/SQL procedures that don’t scale well.

What I’m against, though, is that proper design and usage best practices are skipped when a relational database is used, hallucinating that the ORM would just magically handle all the data manipulations/queries of a blob of poorly structured data. In addition, while ORMs can automatically generate SQLs for a relatively simple data model, they aren’t good at coming up with optimal efficient SQLs for many sophisticated models in the real world.

NoSQL databases versus Relational databases

Another clarification point I thought I should raise is that – I love both SQL-based relational and NoSQL databases, and have adopted them as core parts of different systems in the past. I believe they have their own sweet spots as well as drawbacks, and should be adopted in accordance with the specific need in data persistence and consumption.

I’ve seen some engineering organizations flocking to the NoSQL world for valid reasons, and others just for looking cool. I’ve also seen in a couple of occasions that companies decided to roll back from a NoSQL platform to using relational databases to better address their database transaction need after realizing that their increasing data volume demand can actually be handled fine with a properly designed relational database system.

In general, if your database need leans towards data warehousing and the projected data volume is huge, NoSQL is probably a great choice; otherwise, sticking to using relational databases might be the best deal. It all boils down to specific business requirement, and these days it’s also common that both database types are simultaneously adopted to complement each other. As to what’s considered huge, I would say it warrants a NoSQL database solution when one or more tables need to house 100’s of millions or more rows of data.

Why do relational databases still matter?

The answer to whether relational databases still matter is a decisive yes:

Real-world need of relational data models — A good portion of structured and inter-related data in the real world is still best represented by relational data models. While column-oriented databases excel in handling very large datasets, they aren’t designed for modeling relational data entities.
Transactional CRUD operations — Partly due to NoSQL database’s fundamental design, data often need to be stored in denormalized form for performance, and that makes transactional operations difficult. On the contrary, relational database is a much more suitable model for transactional CRUD operations that many types of applications require. That, coupled with the standard SQL language for transactional CRUD makes the role of relational databases not easily replaceable.
Bulk data manipulations — Besides proven a versatile powerful tool in handling transactional CRUD, SQL also excels in manipulating data in bulk without compromise in atomicity. While PL/SQL isn’t suitable for all kinds of data manipulation tasks, when used with caution it provides procedural functionality in bulk data processing or complex ETL (extract-transform-load).
Improved server hardware — Improvement in server processing power and low cost of memory and storage in recent years have helped make relational databases cope with the increasing demand of high data volume. On top of that, prominent database systems are equipped with robust data sharding and clustering features that also decidedly help in scalability. Relational databases with 10’s or even 100’s of million rows of data in a table aren’t uncommon these days.

Missing skills from today’s software architects

In recent years, I’ve encountered quite a few senior software engineers/architects with advanced programming skills but poor relational data modeling/SQL knowledge. With their computing backgound I believe many of these engineers could pick up the essential knowledge without too much effort. (That being said, I should add that while commanding the relational database fundamentals is rather trivial, becoming a database guru does require some decent effort.) It’s primarily the lack of drive to sharpen their skills in the specific domain that has led to the said phenomenon.

The task of database design still largely falls on the shoulders of the software architect. Most database administrators can configure database systems and fine-tune queries at the operational level to ensure the databases are optimally run, but few possess business requirement knowledge or, in many cases, skills for database design. Suitable database design and data modeling requires intimate knowledge and understanding of business logic of the entire application that is normally in the software architect’s arena.

Even in the NoSQL world of column-oriented databases, I’ve noticed that database design skills are also largely missing. Part of NoSQL database’s signature is that data columns don’t have to be well-defined upfront and can be added later as needed. Because of that, many software architects tend to think that they have the liberty to bypass proper schema design upfront. The truth is that NoSQL databases do need proper schema design as well. For instance, in HBase, due to the by-design limitation of indexing, one needs to carefully lay out upfront what the row key is comprised of and what column families will be maintained.

Old and monolithic?

Aside from causes related to the disruptive technologies described above, some misconceptions that associate relational databases with obsolete technology or monolithic design have also helped contribute to the unwarranted negative attitude towards RDBMS.

Old != Obsolete — Relational database technology is old. Fundamentally it hasn’t changed since decades ago, whereas new computing and data persistence technology buzzwords keep popping up left and right non-stopped. Given so many emerging technologies that one wants to learn all at once, old RDBMS often gets placed at the bottom of the queue. In any case, if a technology is old but continues to excel within its domain, it isn’t obsolete.

RDBMS != Monolith — Contemporary software architects have been advocating against monolithic design. In recent years, more and more applications have been designed and built as microservices with isolated autonomous services and data locality. That’s all great stuff in the ever-evolving software engineering landscape, but when people automatically categorize an application with a high-volume relational database a monolithic system, that’s a flawed assumption.

Bottom line, as long as much of the data in the real world is still best represented in relational data models, RDBMS will have its place in the computing world.

Streaming Real-time Data Into HBase

3 Replies

Fast-write is generally a characteristic strength of distributed NoSQL databases such as HBase, Cassandra. Yet, for a distributed application that needs to capture rapid streams of data in a database, standard connection pooling provided by the database might not be up to the task. For instance, I didn’t get the kind of wanted performance when using HBase’s HTablePool to accommodate real-time streaming of data from a high-parallelism data dumping Storm bolt.

To dump rapid real-time streaming data into HBase, instead of HTablePool it might be more efficient to embed some queueing mechanism in the HBase storage module. An ex-colleague of mine, who is the architect at a VoIP service provider, employs the very mechanism in their production HBase database. Below is a simple implementation that has been tested performing well with a good-sized Storm topology. The code is rather self-explanatory. The HBaseStreamers class consists of a threaded inner class, Streamer, which maintains a queue of HBase Put using LinkedBlockingQueue. Key parameters are in the HBaseStreamers constructor argument list, including the ZooKeeper quorum, HBase table name, HTable auto-flush switch, number of streaming queues and streaming queue capacity.

package hbstream;

import java.util.UUID;
import java.util.concurrent.LinkedBlockingQueue;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

public class HBaseStreamers {
    private Configuration hbaseConfig;
    private Streamer[] streamers;
    private boolean started = false;

    private class Streamer implements Runnable {
        private LinkedBlockingQueue<Put> queue;
        private HTable table;
        private String tableName;
        private int counter = 0;

        public Streamer(String tableName, boolean autoFlush, int capacity) throws Exception {
            table = new HTable(hbaseConfig, tableName);
            table.setAutoFlush(autoFlush);
            this.tableName = tableName;
            queue = new LinkedBlockingQueue<Put>(capacity);
        }

        public void run() {
            while (true) {
                try {
                    Put put = queue.take();
                    table.put(put);
                    counter++;
                }
                catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        public void write(Put put) throws Exception {
            queue.put(put);
        }

        public void flush() {
            if (!table.isAutoFlush()) {
                try {
                    table.flushCommits();
                }
                catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        public int size() {
            return queue.size();
        }

        public int counter() {
            return counter;
        }
    }

    public HBaseStreamers(String quorum, String port, String tableName, boolean autoFlush, int numOfStreamers, int capacity) throws Exception {
        hbaseConfig = HBaseConfiguration.create();
        hbaseConfig.set("hbase.zookeeper.quorum", quorum);
        hbaseConfig.set("hbase.zookeeper.property.clientPort", port);
        streamers = new Streamer[numOfStreamers];
        for (int i = 0; i < streamers.length; i++) {
            streamers[i] = new Streamer(tableName, autoFlush, capacity);
        }
    }

    public Runnable[] getStreamers() {
        return streamers;
    }

    public synchronized void start() {
        if (started) {
            return;
        }
        started = true;
        int count = 1;
        for (Streamer streamer : streamers) {
            new Thread(streamer, streamer.tableName + " HBStreamer " + count).start();
            count++;
        }
    }

    public void write(Put put) throws Exception {
        int i = (int) (System.currentTimeMillis() % streamers.length);
        streamers[i].write(put);
    }

    public void flush() {
        for (Streamer streamer : streamers) {
            streamer.flush();
        }
    }

    public int size() {
        int size = 0;
        for (Streamer st : streamers) {
            size += st.size();
        }
        return size;
    }

    public int counter() {
        int counter = 0;
        for (Streamer st : streamers) {
            counter += st.counter();
        }
        return counter;
    }
}

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

package hbstream;

import java.util.UUID;

import java.util.concurrent.LinkedBlockingQueue;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.hbase.HBaseConfiguration;

import org.apache.hadoop.hbase.client.HTable;

import org.apache.hadoop.hbase.client.Put;

import org.apache.hadoop.hbase.util.Bytes;

public class HBaseStreamers {

private Configuration hbaseConfig;

private Streamer[] streamers;

private boolean started = false;

private class Streamer implements Runnable {

private LinkedBlockingQueue<Put> queue;

private HTable table;

private String tableName;

private int counter = 0;

public Streamer(String tableName, boolean autoFlush, int capacity) throws Exception {

table = new HTable(hbaseConfig, tableName);

table.setAutoFlush(autoFlush);

this.tableName = tableName;

queue = new LinkedBlockingQueue<Put>(capacity);

}

public void run() {

while (true) {

try {

Put put = queue.take();

table.put(put);

counter++;

}

catch (Exception e) {

e.printStackTrace();

}

public void write(Put put) throws Exception {

queue.put(put);

}

public void flush() {

if (!table.isAutoFlush()) {

try {

table.flushCommits();

}

catch (Exception e) {

e.printStackTrace();

}

public int size() {

return queue.size();

}

public int counter() {

return counter;

}

public HBaseStreamers(String quorum, String port, String tableName, boolean autoFlush, int numOfStreamers, int capacity) throws Exception {

hbaseConfig = HBaseConfiguration.create();

hbaseConfig.set("hbase.zookeeper.quorum", quorum);

hbaseConfig.set("hbase.zookeeper.property.clientPort", port);

streamers = new Streamer[numOfStreamers];

for (int i = 0; i < streamers.length; i++) {

streamers[i] = new Streamer(tableName, autoFlush, capacity);

}

public Runnable[] getStreamers() {

return streamers;

}

public synchronized void start() {

if (started) {

return;

}

started = true;

int count = 1;

for (Streamer streamer : streamers) {

new Thread(streamer, streamer.tableName + " HBStreamer " + count).start();

count++;

}

public void write(Put put) throws Exception {

int i = (int) (System.currentTimeMillis() % streamers.length);

streamers[i].write(put);

}

public void flush() {

for (Streamer streamer : streamers) {

streamer.flush();

}

public int size() {

int size = 0;

for (Streamer st : streamers) {

size += st.size();

}

return size;

}

public int counter() {

int counter = 0;

for (Streamer st : streamers) {

counter += st.counter();

}

return counter;

}

Next, write a wrapper class similar to the following to isolate HBase specifics from the streaming application.

package hbstream;
....

public class StreamToHBase {
    private static final String tableName = "stormhbtest";
    private static final byte[] colFamily = Bytes.toBytes("data");
    private static final byte[] colQualifier = Bytes.toBytes("message");
    private static boolean isInit = false;
    private static HBaseStreamers hbStreamers = null;
    ....

    public static synchronized void init(String zkQuorum, String zkPort, boolean autoFlush, int numOfStreamers, int queueCapacity)
        throws Exception {

        if (isInit == true)
            return;
        isInit = true;
        HBaseStreamers streamers = new HBaseStreamers(zkQuorum, zkPort, tableName, autoFlush, numOfStreamers, queueCapacity);
        streamers.start();
        hbStreamers = streamers;
        ....
    }

    public static void writeMessage(String message) throws Exception {
        byte[] value = Bytes.toBytes(message);
        byte[] rowIdBytes = Bytes.toBytes(UUID.randomUUID().toString());
        Put p = new Put(rowIdBytes);
        p.add(colFamily, colQualifier, value);
        if (hbStreamers != null) {
            hbStreamers.write(p);
        }
    }
    ....
}

package hbstream;

....

public class StreamToHBase {

private static final String tableName = "stormhbtest";

private static final byte[] colFamily = Bytes.toBytes("data");

private static final byte[] colQualifier = Bytes.toBytes("message");

private static boolean isInit = false;

private static HBaseStreamers hbStreamers = null;

....

public static synchronized void init(String zkQuorum, String zkPort, boolean autoFlush, int numOfStreamers, int queueCapacity)

throws Exception {

if (isInit == true)

return;

isInit = true;

HBaseStreamers streamers = new HBaseStreamers(zkQuorum, zkPort, tableName, autoFlush, numOfStreamers, queueCapacity);

streamers.start();

hbStreamers = streamers;

....

}

public static void writeMessage(String message) throws Exception {

byte[] value = Bytes.toBytes(message);

byte[] rowIdBytes = Bytes.toBytes(UUID.randomUUID().toString());

Put p = new Put(rowIdBytes);

p.add(colFamily, colQualifier, value);

if (hbStreamers != null) {

hbStreamers.write(p);

}

....

}

To test it with a distributed streaming application using Storm, write a bolt similar to the following skeleton. All that is needed is to initialize HBaseStreamers from within the bolt’s prepare() method and dump data to HBase from within bolt’s execute().

package hbstream;
....

public class HBStreamTestBolt extends BaseRichBolt {
    OutputCollector _collector;
    ....

    @Override
    public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
        ....
        try {
            StreamToHBase.init("172.16.47.101, 172.16.47.102, 172.16.47.103", "2181", false, 4, 1000);
        }
        catch (Exception e) {
            ....
        }
        ....
    }

    @Override
    public void execute(Tuple tuple) {
        ....
        try {
            StreamToHBase.writeMessage(message);
        }
        catch (Exception e) {
            ....
        }
        ....
    }
    ....
}

package hbstream;

....

public class HBStreamTestBolt extends BaseRichBolt {

OutputCollector _collector;

....

@Override

public void prepare(Map conf, TopologyContext context, OutputCollector collector) {

....

try {

StreamToHBase.init("172.16.47.101, 172.16.47.102, 172.16.47.103", "2181", false, 4, 1000);

}

catch (Exception e) {

....

}

....

}

@Override

public void execute(Tuple tuple) {

....

try {

StreamToHBase.writeMessage(message);

}

catch (Exception e) {

....

}

....

}

....

}

Finally, write a Storm spout to serve as the streaming data source and a Storm topology builder to put the spout and bolt together.

package hbstream;
....

public class HBStreamTestSpout extends BaseRichSpout {
    SpoutOutputCollector _collector;
    ....

    @Override
    public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
        _collector = collector;
        words = new ArrayList<String>();
        ....
    }

    @Override
    public void nextTuple() {
        int rand = (int) (Math.random() * 1000);
        String word = words.get(rand % words.size());
        _collector.emit(new Values(word));
        ....
    }
    ....
}

package hbstream;

....

public class HBStreamTestSpout extends BaseRichSpout {

SpoutOutputCollector _collector;

....

@Override

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

_collector = collector;

words = new ArrayList<String>();

....

}

@Override

public void nextTuple() {

int rand = (int) (Math.random() * 1000);

String word = words.get(rand % words.size());

_collector.emit(new Values(word));

....

}

....

}

package hbstream;
....

public class HBStreamTopology {
    ....

    public static void main(String[] args) throws Exception {
        TopologyBuilder builder = new TopologyBuilder();
        builder.setSpout("testSpout", new HBStreamTestSpout(), 4);
        builder.setBolt("testBolt", new HBStreamTestBolt(), 6)
               .shuffleGrouping("testSpout");

        Config conf = new Config();
        conf.setNumWorkers(2);
        StormSubmitter.submitTopology("HBStreamTopology", conf, builder.createTopology());
        ....
    }
}

package hbstream;

....

public class HBStreamTopology {

....

public static void main(String[] args) throws Exception {

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("testSpout", new HBStreamTestSpout(), 4);

builder.setBolt("testBolt", new HBStreamTestBolt(), 6)

.shuffleGrouping("testSpout");

Config conf = new Config();

conf.setNumWorkers(2);

StormSubmitter.submitTopology("HBStreamTopology", conf, builder.createTopology());

....

}

The parallelism/queue parameters are set to relatively small numbers in the above sample code. Once tested working, one can tweak all the various dials in accordance with the server cluster capacity. These dials include the following:

StreamToHBase.init():
    - boolean autoFlush
    - int numOfStreamers
    - int queueCapacity

TopologyBuilder.setSpout():
    - Number parallelismHint

TopologyBuilder.setBolt():
    - Number parallelismHint

Config.setNumWorkers():
    - int workers

StreamToHBase.init():

- boolean autoFlush

- int numOfStreamers

- int queueCapacity

TopologyBuilder.setSpout():

- Number parallelismHint

TopologyBuilder.setBolt():

- Number parallelismHint

Config.setNumWorkers():

- int workers

For simplicity, only HBase Put is being handled in the above implementation. It certainly can be expanded to handle also HBase Increment so as to carry out aggregation functions such as count. The primary goal of this Storm-to-HBase streaming exercise is to showcase the using of a module equipped with some “elasticity” by means of configurable queues. The queueing mechanism within HBaseStreamers provides cushioned funnels for the data streams and helps optimize the overall data intake bandwidth. Keep in mind, though, that doesn’t remove the need of administration work for a properly configured HBase-Hadoop system.

Genuine Blog

A Tech Blog by Leo Cheung

Tag Archives: nosql database

Relational Database Redemption

What happened to relational Databases?

Object-oriented data persistence architecture

NoSQL databases

Object-oriented data persistence versus SQL, PL/SQL

NoSQL databases versus Relational databases

Why do relational databases still matter?

Missing skills from today’s software architects

Old and monolithic?

Streaming Real-time Data Into HBase