Real-time Big Data Revisited

My previous blog post about real-time Big Data centers around some relevant open-source software (e.g. Storm, Kafka). This post shifts the focus towards reviewing its current state.

One thing the computing technology industry has never been starved of is the successive up and down of buzzwords – B2B, P2P, SOA, AOP, M2M, SaaS/PaaS, IOT, RWD (responsive web design), SDN (software-defined networking), … Recently, Big Data is one of the few that has taken the center stage.

How big is Big Data?

What is Big Data anyway? Typical structured data is in table format with columns and rows. For example, a dataset of 500,000 Web pages might be represented by 500,000 rows of data each with 3 columns of text: URL, page title, page content. In general, people use the term Big Data to represent data with large amount of columns and/or rows. But how big is big?

The “yield point” at which a contemporary RDBMS (relational database management system) can no longer perform well on decent server hardware is often considered the starting point for a Big Data system. That’s obviously a vague unscientific reference. In a recent startup operation, we maintained a pretty massive transactional RDBMS (with fail-over) on a couple of ordinary quad-core Xeon server boxes stuffed with a bunch of RAID 0+1 disks. There were a couple of optimally tuned transactional tables at 400+ million rows with actively used queries outer-joining them and the database performed just fine, showing no signs of yield any time soon. On the other hand, I had also seen ordinary queries bringing a database down to halt with transactional tables at just a few million rows.

Is Big Data for everyone?

Nevertheless, I’ve heard quite a few horror stories about companies delving into Big Data only to realize the extensive (read: expensive) R&D work was unjustified. Some grudgingly returned to the relational database model after pouring tons of resource into building a column-oriented distributed database system. It’s tempting to conclude that you need to immediately switch from RDBMS to column-oriented databases when a projection shows that your dataset will grow to 1 petabytes in 3 years. The conclusion may be flawed if the actual business requirement analysis isn’t thorough. For instance, it could be that:

  • the dataset won’t reach anywhere near a small percentage of the petabyte scale for the first 2+ years
  • data older than 3 months is not required to be in raw format and can be aggregated to only fractions of the original data volume
  • the petabytes data size is due to certain huge data fields and actual row size is under tens of millions, which can be managed with a properly administered RDBMS

There are a lot of tech discussions about the pros and cons of relational databases versus column-oriented databases so I’m not going to repeat those arguments. It suffices to say that by switching from RDBMS to column-oriented databases, you’re trading away a whole bunch of good stuff that relational databases offer, for primarily high data capacity, fast write and built-in fault tolerance.

Adding real-time into the mix

Real-time is a term subject to contextual interpretation. In a more loose sense, response time in milliseconds to a few seconds is often regarded as real-time. As data volume increases, even such a loose requirement is no easy matter.

Let’s say it’s objectively determined that column-oriented database needs to be a part of your Big Data system, the next question is probably about how “real-time” you need the system to service data requests. Trying to make every bit of data in a Big Data system available for real-time (or near-real-time) random access is a difficult proposition. A more practical approach is to maintain a data warehouse with a set of updatable pre-computed views on all persisted data augmented by a real-time subsystem which provides access to the recently transacted data that hasn’t made it to the warehouse. The real-time subsystem will be kept relatively lean by regularly discarding data that has been secured in the warehouse.

Lambda Architecture

The Lambda Architecture advocated by Nathan Marz (the creator of Storm) proposes a Big Data system composed of a batch and a real-time subsystems to cooperatively serve real-time queries across the entire persisted dataset. Based on a preview of the early-access-edition book by Marz, my understanding of the architecture is that it consists of:

  • a Batch Layer that appends data to the immutable master dataset and continuously refreshes batch views (in the form of query functions) by recomputing arbitrary functions on the entire dataset
  • a Serving Layer that processes the batch views and provides query service
  • a Speed Layer that processes real-time views from newly acquired data and regularly rotates data off to the Batch Layer

Apparently, the architecture’s underlying design is oriented towards functional programming which is in principle rooted in Lambda Calculus. Under this computing paradigm, arbitrary data processing operations are expressed as compositions of functions which are program state-independent and operate on the entire immutable dataset.

The architecture also showcases the principle of separation of concern with each of the layers handling specific Big Data tasks it’s purposely designed for. The master dataset is maintained in the Batch Layer as append-only immutable raw data on a redundant distributed computing platform (e.g. Hadoop HDFS), allowing full data reprocessing in the event of major data processing errors. On the other hand, the Speed Layer would be better served by a real-time messaging or streaming system (e.g. Storm) backed by a random read-write capable persistent storage (e.g HBase). It’s an architecture that is elegant in principle and I look forward to seeing its final edition and real-world implementations.

Is real-time Big Data ripe for mainstream businesses?

Aside from distribution companies such as Cloudera, HortonWorks, there is a wide range of companies and startups building their entire business on providing Big Data service. Then there are these tech giants (e.g. EMC) which see Big Data a significant part of their strategic direction. As to the need for real-time, there has been debate on whether the actual demand is imminent for businesses, other than a handful of global real-time search/newsfeed services such as Twitter.

On one hand, a bunch of commercial products and open-source software frameworks have emerged to address the very need. On the other hand, businesses at large are still struggling to interpret the actual needs (i.e. how big and how real-time) by themselves and/or customers. Here’s one data point – I recently had a discussion with a founder of a Big Data platform provider who expressed skepticism about the imminent demand for real-time Big Data based on what he heard from his customers.

Today, short of a robust industry-standards framework, many businesses take custom approaches to dump incoming data into a column-oriented database like HBase, perform filtering scans and output selective data into a relational database for their real-time query need. Until a readily customizable framework with a robust underlying architecture like the Lambda Architecture is available, these businesses will have to continue to exhaust engineering resource to build their own real-time Big Data solutions.

Challenges Of Big Data + SaaS + HAN

This is part two of a previous post about building and operating a Big Data SaaS for Home Area Network devices during my 5-year tenure with EcoFactor. Simply put, our main goal was to add “smarts” to residential heating and cooling systems (i.e. heaters and air conditioners, a.k.a. HVAC) via ordinary thermostats. That focus led to a superficial perception by some people that we’re a smart thermostat device company. In actuality, we have always been a software service, virtually agnostic to both hardware and communications protocol. It’s more of an IoT version of the “Intel Inside” business model.

Challenges from all fronts

Like building any startup company, there was a wide spectrum of challenges confronting us which is what this post is going to talk about. Funding environment was pretty hellish as we started just shortly before the financial crisis in 2007-2008. And failure of some high-profile solar companies in subsequent years certainly didn’t help make the once hyped cleantech a favorable sector for investors.

The ever-growing fierce competition for software engineering talent was and has been a big challenge for pretty much every startup in the Silicon Valley. On the technology front, production-grade open-source Big Data technologies weren’t there, leading to the need for a lot of internal R&D effort by individual companies, which in turn requires domain experts in both development and operations who were scarce endangered species back then, thus completing the vicious infinite loop that starts with the hiring difficulty.

Operational processes

On the operational front, there was a long list of processes that need to be carefully established and managed – from user acquisition, on-boarding, device installer training, scheduling coordination for on-site device installation, technical support for installers, to customer service. To get into the details of how all that was done warrants writing a book. In charge of product and marketing, Scott Hublou who is also a co-founder of the company owned the “horrendous” list.

Many of the items in the list are correlated. For instance, getting HVAC technicians to create a HAN network and pair up thermostats with the HAN gateway during an on-site installation not only required a custom-built software tool with a well-thoughtout workflow and easy UI, but also thorough training and a knowledgeable support team to back them up for ad-hoc troubleshooting.

Back to the engineering side of the world, a key piece in operations is the technology infrastructure that needs to cope with future business growth. That includes systems hosting, network and data architecture, server clusters for distributed computing, load balancing systems, fail-over and monitoring mechanism, firewalls, etc. As a startup company, we started with something simple but expandable to conserve cash, and scaled up as quickly as necessary. That’s also a practical approach from the design point of view to avoid over-engineering.

State of WPAN

On hardware, applicable HAN communications protocol and HAN device hardware were far from ready for mass deployment at the time when we started exploring in that space. That’s a non-trivial challenge for anybody who wants to get into the very space. On the other hand, if done right it represents an opportunity for one to pioneer in a relatively new arena.

ZigBee, an IEEE 802.15.4 standard WPAN (Wireless Personal Area Network) protocol, was our selected communications protocol for scaled deployment. While it’s a robust protocol compared with others such as Z-Wave, its specifications was still undergoing changes and few real-world implementations had ever exploited its full features.

The protocol comes with a few predefined application profiles including Energy Efficiency and Home Automation profiles. Part of our core business is about translating HVAC operations data via thermostats into actionable business intelligence, hence ability to acquire key attributes from these devices is crucial. We quickly discovered that some attributes as basic as HVAC state were missing in certain application profiles and we had to not only utilize multiple profiles but also extend to using custom attributes in ZCL (ZigBee Cluster Library).

Working with technology partners

Working with hardware technology partners does present some other challenges. HAN device firmware and embedded software development is a totally different beast from SaaS/server application development. Python on Linux is a prominent embedded software platform. While that’s also a popular combo for server software development, the two worlds share little resemblance. Building a system that bridges the two worlds takes learning and collaborative effort from both camps.

Some of our HAN device partners were quick to realize the significance of the need to back their gateway devices with a scalable PaaS infrastructure and invest significant effort in M2M (Machine-to-Machine) through acquisition and internal development. But coming from a hardware background, there was inevitably a non-trivial learning curve for our hardware partners to get it right in areas such as software service scalability. Leveraging our internal scalable SaaS development experience and our partners’ embedded software engineering expertise, we managed to put together the best ingredients from both worlds into the cooperative work.

OTA firmware update

OTA (Over-the-Air) firmware update generally refers to wireless firmware update. Our devices run on a WPAN protocol and the firmware is OTA-able. It’s probably one of the operations that create the most anxiety, as an update failure may result in “bricking” the devices in volume, leading to the worst user experience. A bricked thermostat that results in an inoperable HVAC (i.e. heater / air conditioner) would be the last thing the home occupant wants to deal with on a 105F Summer day, or worse, a potentially life-threatening hazard on a 10F Winter night.

This critical task is all about making sure the entire update procedure is foolproof from end to end. The important thing is to go through lots of rehearsals in advance. In addition, the capability of rollback of firmware version is as critical as the forward-update so to undo the update should unforeseen issues arise post-update. Startups typically work at a cut-throat pace that it’s tempting to circumvent pre-production tests whenever possible. But this is one of those operations that even a minor compromise of stringent tests could mean end of business.

Pull vs Push

The around-the-clock time series data acquisition from a growing volume of primitive HAN devices is a capacity-intensive requirement. Understanding that it was going to be a temporary method for smaller-scale deployments, we started out using a simplistic pull model to mechanically acquire data from the HAN gateway devices. These devices gather data serially from their associated thermostat devices, making a single trip to a gateway-connected thermostat device cost a few seconds to tens of seconds. To come up with a data acquisition method that could scale, we needed something that is at least an order of magnitude faster.

With larger-scale deployments in the pipeline, we didn’t waste any time and worked collaboratively with all involved parties early on to build a scalable solution. We went back to the drawing board to scrutinize the various data communication methods that are supported by the WPAN specifications and laid out a few architectural changes. First, we switched the data acquisition model from pull to push. Such change affected not only data communications within our internal SaaS applications but the end-to-end data flow spanning across our partners’ PaaS systems.

One of the key changes was to come up with standards compliant methods that minimize necessary data retrievals via unexploited features such as attribute grouping and differential reporting under the push model. Attribute grouping allows selected attributes to be bundled as a single packet for delivery instead of spitting individual attributes serially in multiple deliveries. Differential reporting helps minimize necessary data deliveries by triggering data transfer only when at least one of the selected attributes has changed. All that means lots of extra work for everybody in the short term, but in exchange for a scalable solution in the long run.

Collaborative work pays off

The challenges mentioned above wouldn’t be resolvable hadn’t there been a team of cross-functional group technologists working diligently and creatively to make it happen. Performance was boosted by orders of magnitude after implementing the new data acquisition method. More importantly, the collective work in some way set a standard for large-scale data acquisition from SaaS-managed HAN devices. It was an invaluable experience being a part of the endeavor.

Thoughts On Technology Marketing

In the commercial sector, many technology companies succeed in their space with average technology wrapped with slick business strategy. Only a handful of them rise to the top by showcasing technology superior to their competitors. It’s safe to assert that a viable business model coupled with right execution remains the driving key success factor.

Superior technology

Nevertheless, superiority in technology helps get a company ahead of its competitors. The lead may not persists forever, but while it lasts the fame of being the best helps seize significant market share. Every once in a while, we see products and services backed by superior technology win over mass customers and crush competitors. Examples are Oracle’s database, Sun’s server hardware, Google’s search algorithm and Apple iPhone hardware/UI.

On the other hand, no matter how great your product is, competitors tend to catch up with comparable technologies quickly. Superiority in technology alone isn’t sufficient for success, but it does gain respect from the technology community and a direct consequence is that the company is likely to attract top technology talent. Conversely, being widely perceived technologically inferior to its competitors would sooner or later cause the company to lose.

There are lots of brilliant technologists, so emergence of great technological work isn’t rare. What’s rare is right execution with the right timing. From time to time, we see great ideas implemented at the “wrong time” like:

  • Resource-intensive GUI on computers when CPU speed was at 5 MHz
  • Cloud-based systems with common Internet connection at 56 kbit/s

Why technology marketing?

We do not have much control on the right timing, which is often prone to subjective interpretation plus a bit of luck. Semantic Web, Internet of Things are two examples that forward-thinking technologists started advocating more than a decade ago, yet they are nowhere near being widely adopted in their supposedly ubiquitous form. We do however have control on how to capitalize the using of internally developed technologies beyond building product. One approach is to publish or open-source selected technologies. Intentional or not, this is a marketing effort. Below are a few bullet points highlighting some of the benefits:

  • Project an image of being a technology pioneer
  • Good Samaritan, giving back to the technology community
  • Benefit from the general collaborative open-source development effort
  • Attract top technologists
  • Get feedback for improvement from a broad technology community

Bottomline is that, almost everything especially in the commercial sector needs some marketing effort to shine. Technology is no difference. More importantly, marketing your product directly is inevitably met with normal skepticism as you’re supposed to talk up your own product, and the effect is short-lived as every business including your competitors is doing the same thing. Marketing the underlying technology of your product adds subtlety to the conventional product marketing effect which customers have long been numb to.

Publicizing technologies

Many technology companies have already been doing that:

  • Google published their Big Data work such as MapReduce and BigTable, released interface definition language Protocol Buffers, among many other things.
  • As another company that deals with data at the real Big Data level, Facebook gave out NoSQL database Cassandra, interface definition language Thrift, and Scribe for streamed data aggregation.
  • Yahoo still gets a lot of respect from the technology community not because of their search engine, email service or its media popular CEO, but their relevance in the Big Data technology space, particularly Hadoop.
  • Twitter incubated distributed real-time streaming software, Storm.
  • LinkedIn created high-performance distributed messaging software, Kafka.
  • Netflix rolled out Java library, Curator, for ZooKeeper and a bunch of cloud-centric software.
  • Meanwhile I’m not aware of any open-source contribution from Amazon, but the popularity of their EC2 platform made them a cloud service pioneer. The retail giant was hardly perceived a leading tech company before they expanded into the cloud service.

When not to publicize your technologies?

Publicizing your internally developed technologies isn’t necessarily a good move in all cases. It might not be a good idea to expose technologies to the public especially in the form of open-source if the technology:

  • consists of your core business intellectual property (i.e. secret sauce)
  • isn’t compliant with industry standards
  • doesn’t work well on contemporary open-source platforms
  • is just “yet another” ordinary implementation of certain technology
  • hasn’t been and won’t be used in some of your own products
  • isn’t polished enough to give out to external technologists

Like any marketing effort, technology marketing takes significant resource. That’s why companies who can afford doing that are in general well-established with abundant engineering resource. However, even for smaller companies and startups, if there is marketable and shareable technological work along with the right expertise in-house, it’s still worth serious consideration to publicize it.