My previous blog post about real-time Big Data centers around some relevant open-source software (e.g. Storm, Kafka). This post shifts the focus towards reviewing its current state.
One thing the computing technology industry has never been starved of is the successive up and down of buzzwords – B2B, P2P, SOA, AOP, M2M, SaaS/PaaS, IOT, RWD (responsive web design), SDN (software-defined networking), … Recently, Big Data is one of the few that has taken the center stage.
How big is Big Data?
What is Big Data anyway? Typical structured data is in table format with columns and rows. For example, a dataset of 500,000 Web pages might be represented by 500,000 rows of data each with 3 columns of text: URL, page title, page content. In general, people use the term Big Data to represent data with large amount of columns and/or rows. But how big is big?
The “yield point” at which a contemporary RDBMS (relational database management system) can no longer perform well on decent server hardware is often considered the starting point for a Big Data system. That’s obviously a vague unscientific reference. In a recent startup operation, we maintained a pretty massive transactional RDBMS (with fail-over) on a couple of ordinary quad-core Xeon server boxes stuffed with a bunch of RAID 0+1 disks. There were a couple of optimally tuned transactional tables at 400+ million rows with actively used queries outer-joining them and the database performed just fine, showing no signs of yield any time soon. On the other hand, I had also seen ordinary queries bringing a database down to halt with transactional tables at just a few million rows.
Is Big Data for everyone?
Nevertheless, I’ve heard quite a few horror stories about companies delving into Big Data only to realize the extensive (read: expensive) R&D work was unjustified. Some grudgingly returned to the relational database model after pouring tons of resource into building a column-oriented distributed database system. It’s tempting to conclude that you need to immediately switch from RDBMS to column-oriented databases when a projection shows that your dataset will grow to 1 petabytes in 3 years. The conclusion may be flawed if the actual business requirement analysis isn’t thorough. For instance, it could be that:
- the dataset won’t reach anywhere near a small percentage of the petabyte scale for the first 2+ years
- data older than 3 months is not required to be in raw format and can be aggregated to only fractions of the original data volume
- the petabytes data size is due to certain huge data fields and actual row size is under tens of millions, which can be managed with a properly administered RDBMS
There are a lot of tech discussions about the pros and cons of relational databases versus column-oriented databases so I’m not going to repeat those arguments. It suffices to say that by switching from RDBMS to column-oriented databases, you’re trading away a whole bunch of good stuff that relational databases offer, for primarily high data capacity, fast write and built-in fault tolerance.
Adding real-time into the mix
Real-time is a term subject to contextual interpretation. In a more loose sense, response time in milliseconds to a few seconds is often regarded as real-time. As data volume increases, even such a loose requirement is no easy matter.
Let’s say it’s objectively determined that column-oriented database needs to be a part of your Big Data system, the next question is probably about how “real-time” you need the system to service data requests. Trying to make every bit of data in a Big Data system available for real-time (or near-real-time) random access is a difficult proposition. A more practical approach is to maintain a data warehouse with a set of updatable pre-computed views on all persisted data augmented by a real-time subsystem which provides access to the recently transacted data that hasn’t made it to the warehouse. The real-time subsystem will be kept relatively lean by regularly discarding data that has been secured in the warehouse.
The Lambda Architecture advocated by Nathan Marz (the creator of Storm) proposes a Big Data system composed of a batch and a real-time subsystems to cooperatively serve real-time queries across the entire persisted dataset. Based on a preview of the early-access-edition book by Marz, my understanding of the architecture is that it consists of:
- a Batch Layer that appends data to the immutable master dataset and continuously refreshes batch views (in the form of query functions) by recomputing arbitrary functions on the entire dataset
- a Serving Layer that processes the batch views and provides query service
- a Speed Layer that processes real-time views from newly acquired data and regularly rotates data off to the Batch Layer
Apparently, the architecture’s underlying design is oriented towards functional programming which is in principle rooted in Lambda Calculus. Under this computing paradigm, arbitrary data processing operations are expressed as compositions of functions which are program state-independent and operate on the entire immutable dataset.
The architecture also showcases the principle of separation of concern with each of the layers handling specific Big Data tasks it’s purposely designed for. The master dataset is maintained in the Batch Layer as append-only immutable raw data on a redundant distributed computing platform (e.g. Hadoop HDFS), allowing full data reprocessing in the event of major data processing errors. On the other hand, the Speed Layer would be better served by a real-time messaging or streaming system (e.g. Storm) backed by a random read-write capable persistent storage (e.g HBase). It’s an architecture that is elegant in principle and I look forward to seeing its final edition and real-world implementations.
Is real-time Big Data ripe for mainstream businesses?
Aside from distribution companies such as Cloudera, HortonWorks, there is a wide range of companies and startups building their entire business on providing Big Data service. Then there are these tech giants (e.g. EMC) which see Big Data a significant part of their strategic direction. As to the need for real-time, there has been debate on whether the actual demand is imminent for businesses, other than a handful of global real-time search/newsfeed services such as Twitter.
On one hand, a bunch of commercial products and open-source software frameworks have emerged to address the very need. On the other hand, businesses at large are still struggling to interpret the actual needs (i.e. how big and how real-time) by themselves and/or customers. Here’s one data point – I recently had a discussion with a founder of a Big Data platform provider who expressed skepticism about the imminent demand for real-time Big Data based on what he heard from his customers.
My own observation (admittedly with a limited scope) is that the real-time or near-real-time demand perhaps in an obscure fashion is already there today for many businesses. In other words, I think most Big Data companies already have something in their systems as part of their business requirement to address the real-time need to a range of extent. And I believe such observation isn’t skewed by my own career experience. If you’re maintaining a Big Data operation, chances are that you’re already implementing some sort of real-time subsystem.
Today, short of a robust customizable framework, many businesses take a relatively simpler approach to dump incoming data into a column-oriented database like HBase, perform filtering scans and ouput the selective data into a relational database for their real-time query need. Until a readily customizable framework with a robust underlying architecture like the Lambda Architecture is available, these businesses will have to continue to exhaust engineering resource to build their own real-time Big Data solutions.