Tag Archives: big data

Real-time Big Data Revisited

My previous blog post about real-time Big Data centers around some relevant open-source software (e.g. Storm, Kafka). This post shifts the focus towards reviewing its current state.

One thing the computing technology industry has never been starved of is the successive up and down of buzzwords – B2B, P2P, SOA, AOP, M2M, SaaS/PaaS, IOT, RWD (responsive web design), SDN (software-defined networking), … Recently, Big Data is one of the few that has taken the center stage.

How big is Big Data?

What is Big Data anyway? Typical structured data is in table format with columns and rows. For example, a dataset of 500,000 Web pages might be represented by 500,000 rows of data each with 3 columns of text: URL, page title, page content. In general, people use the term Big Data to represent data with large amount of columns and/or rows. But how big is big?

The “yield point” at which a contemporary RDBMS (relational database management system) can no longer perform well on decent server hardware is often considered the starting point for a Big Data system. That’s obviously a vague unscientific reference. In a recent startup operation, we maintained a pretty massive transactional RDBMS (with fail-over) on a couple of ordinary quad-core Xeon server boxes stuffed with a bunch of RAID 0+1 disks. There were a couple of optimally tuned transactional tables at 400+ million rows with actively used queries outer-joining them and the database performed just fine, showing no signs of yield any time soon. On the other hand, I had also seen ordinary queries bringing a database down to halt with transactional tables at just a few million rows.

Is Big Data for everyone?

Nevertheless, I’ve heard quite a few horror stories about companies delving into Big Data only to realize the extensive (read: expensive) R&D work was unjustified. Some grudgingly returned to the relational database model after pouring tons of resource into building a column-oriented distributed database system. It’s tempting to conclude that you need to immediately switch from RDBMS to column-oriented databases when a projection shows that your dataset will grow to 1 petabytes in 3 years. The conclusion may be flawed if the actual business requirement analysis isn’t thorough. For instance, it could be that:

  • the dataset won’t reach anywhere near a small percentage of the petabyte scale for the first 2+ years
  • data older than 3 months is not required to be in raw format and can be aggregated to only fractions of the original data volume
  • the petabytes data size is due to certain huge data fields and actual row size is under tens of millions, which can be managed with a properly administered RDBMS

There are a lot of tech discussions about the pros and cons of relational databases versus column-oriented databases so I’m not going to repeat those arguments. It suffices to say that by switching from RDBMS to column-oriented databases, you’re trading away a whole bunch of good stuff that relational databases offer, for primarily high data capacity, fast write and built-in fault tolerance.

Adding real-time into the mix

Real-time is a term subject to contextual interpretation. In a more loose sense, response time in milliseconds to a few seconds is often regarded as real-time. As data volume increases, even such a loose requirement is no easy matter.

Let’s say it’s objectively determined that column-oriented database needs to be a part of your Big Data system, the next question is probably about how “real-time” you need the system to service data requests. Trying to make every bit of data in a Big Data system available for real-time (or near-real-time) random access is a difficult proposition. A more practical approach is to maintain a data warehouse with a set of updatable pre-computed views on all persisted data augmented by a real-time subsystem which provides access to the recently transacted data that hasn’t made it to the warehouse. The real-time subsystem will be kept relatively lean by regularly discarding data that has been secured in the warehouse.

Lambda Architecture

The Lambda Architecture advocated by Nathan Marz (the creator of Storm) proposes a Big Data system composed of a batch and a real-time subsystems to cooperatively serve real-time queries across the entire persisted dataset. Based on a preview of the early-access-edition book by Marz, my understanding of the architecture is that it consists of:

  • a Batch Layer that appends data to the immutable master dataset and continuously refreshes batch views (in the form of query functions) by recomputing arbitrary functions on the entire dataset
  • a Serving Layer that processes the batch views and provides query service
  • a Speed Layer that processes real-time views from newly acquired data and regularly rotates data off to the Batch Layer

Apparently, the architecture’s underlying design is oriented towards functional programming which is in principle rooted in Lambda Calculus. Under this computing paradigm, arbitrary data processing operations are expressed as compositions of functions which are program state-independent and operate on the entire immutable dataset.

The architecture also showcases the principle of separation of concern with each of the layers handling specific Big Data tasks it’s purposely designed for. The master dataset is maintained in the Batch Layer as append-only immutable raw data on a redundant distributed computing platform (e.g. Hadoop HDFS), allowing full data reprocessing in the event of major data processing errors. On the other hand, the Speed Layer would be better served by a real-time messaging or streaming system (e.g. Storm) backed by a random read-write capable persistent storage (e.g HBase). It’s an architecture that is elegant in principle and I look forward to seeing its final edition and real-world implementations.

Is real-time Big Data ripe for mainstream businesses?

Aside from distribution companies such as Cloudera, HortonWorks, there is a wide range of companies and startups building their entire business on providing Big Data service. Then there are these tech giants (e.g. EMC) which see Big Data a significant part of their strategic direction. As to the need for real-time, there has been debate on whether the actual demand is imminent for businesses, other than a handful of global real-time search/newsfeed services such as Twitter.

On one hand, a bunch of commercial products and open-source software frameworks have emerged to address the very need. On the other hand, businesses at large are still struggling to interpret the actual needs (i.e. how big and how real-time) by themselves and/or customers. Here’s one data point – I recently had a discussion with a founder of a Big Data platform provider who expressed skepticism about the imminent demand for real-time Big Data based on what he heard from his customers.

My own observation (admittedly with a limited scope) is that the real-time or near-real-time demand perhaps in an obscure fashion is already there today for many businesses. In other words, I think most Big Data companies already have something in their systems as part of their business requirement to address the real-time need to a range of extent. And I believe such observation isn’t skewed by my own career experience. If you’re maintaining a Big Data operation, chances are that you’re already implementing some sort of real-time subsystem.

Today, short of a robust customizable framework, many businesses take a relatively simpler approach to dump incoming data into a column-oriented database like HBase, perform filtering scans and ouput the selective data into a relational database for their real-time query need. Until a readily customizable framework with a robust underlying architecture like the Lambda Architecture is available, these businesses will have to continue to exhaust engineering resource to build their own real-time Big Data solutions.

Challenges Of Big Data + SaaS + HAN

This is part two of a previous post about building and operating a Big Data SaaS for Home Area Network devices during my 5-year tenure with EcoFactor. Simply put, our main goal was to add “smarts” to residential heating and cooling systems (i.e. heaters and air conditioners, a.k.a. HVAC) via ordinary thermostats. That focus led to a superficial perception by some people that we’re a smart thermostat device company. In actuality, we have always been a software service, virtually agnostic to both hardware and communications protocol. It’s more of an IoT version of the “Intel Inside” business model.

Challenges from all fronts

Like building any startup company, there was a wide spectrum of challenges confronting us which is what this post is going to talk about. Funding environment was pretty hellish as we started just shortly before the financial crisis in 2007-2008. And failure of some high-profile solar companies in subsequent years certainly didn’t help make the once hyped cleantech a favorable sector for investors.

The ever-growing fierce competition for software engineering talent was and has been a big challenge for pretty much every startup in the Silicon Valley. On the technology front, production-grade open-source Big Data technologies weren’t there, leading to the need for a lot of internal R&D effort by individual companies, which in turn requires domain experts in both development and operations who were scarce endangered species back then, thus completing the vicious infinite loop that starts with the hiring difficulty.

Operational processes

On the operational front, there was a long list of processes that need to be carefully established and managed – from user acquisition, on-boarding, device installer training, scheduling coordination for on-site device installation, technical support for installers, to customer service. To get into the details of how all that was done warrants writing a book. In charge of product and marketing, Scott Hublou who is also a co-founder of the company owned the “horrendous” list.

Many of the items in the list are correlated. For instance, getting HVAC technicians to create a HAN network and pair up thermostats with the HAN gateway during an on-site installation not only required a custom-built software tool with a well-thoughtout workflow and easy UI, but also thorough training and a knowledgeable support team to back them up for ad-hoc troubleshooting.

Back to the engineering side of the world, a key piece in operations is the technology infrastructure that needs to cope with future business growth. That includes systems hosting, network and data architecture, server clusters for distributed computing, load balancing systems, fail-over and monitoring mechanism, firewalls, etc. As a startup company, we started with something simple but expandable to conserve cash, and scaled up as quickly as necessary. That’s also a practical approach from the design point of view to avoid over-engineering.

State of WPAN

On hardware, applicable HAN communications protocol and HAN device hardware were far from ready for mass deployment at the time when we started exploring in that space. That’s a non-trivial challenge for anybody who wants to get into the very space. On the other hand, if done right it represents an opportunity for one to pioneer in a relatively new arena.

ZigBee, an IEEE 802.15.4 standard WPAN (Wireless Personal Area Network) protocol, was our selected communications protocol for scaled deployment. While it’s a robust protocol compared with others such as Z-Wave, its specifications was still undergoing changes and few real-world implementations had ever exploited its full features.

The protocol comes with a few predefined application profiles including Energy Efficiency and Home Automation profiles. Part of our core business is about translating HVAC operations data via thermostats into actionable business intelligence, hence ability to acquire key attributes from these devices is crucial. We quickly discovered that some attributes as basic as HVAC state were missing in certain application profiles and we had to not only utilize multiple profiles but also extend to using custom attributes in ZCL (ZigBee Cluster Library).

Working with technology partners

Working with hardware technology partners does present some other challenges. HAN device firmware and embedded software development is a totally different beast from SaaS/server application development. Python on Linux is a prominent embedded software platform. While that’s also a popular combo for server software development, the two worlds share little resemblance. Building a system that bridges the two worlds takes learning and collaborative effort from both camps.

Some of our HAN device partners were quick to realize the significance of the need to back their gateway devices with a scalable PaaS infrastructure and invest significant effort in M2M (Machine-to-Machine) through acquisition and internal development. But coming from a hardware background, there was inevitably a non-trivial learning curve for our hardware partners to get it right in areas such as software service scalability. Leveraging our internal scalable SaaS development experience and our partners’ embedded software engineering expertise, we managed to put together the best ingredients from both worlds into the cooperative work.

OTA firmware update

OTA (Over-the-Air) firmware update generally refers to wireless firmware update. Our devices run on a WPAN protocol and the firmware is OTA-able. It’s probably one of the operations that create the most anxiety, as an update failure may result in “bricking” the devices in volume, leading to the worst user experience. A bricked thermostat that results in an inoperable HVAC (i.e. heater / air conditioner) would be the last thing the home occupant wants to deal with on a 105F Summer day, or worse, a potentially life-threatening hazard on a 10F Winter night.

This critical task is all about making sure the entire update procedure is foolproof from end to end. The important thing is to go through lots of rehearsals in advance. In addition, the capability of rollback of firmware version is as critical as the forward-update so to undo the update should unforeseen issues arise post-update. Startups typically work at a cut-throat pace that it’s tempting to circumvent pre-production tests whenever possible. But this is one of those operations that even a minor compromise of stringent tests could mean end of business.

Pull vs Push

The around-the-clock time series data acquisition from a growing volume of primitive HAN devices is a capacity-intensive requirement. Understanding that it was going to be a temporary method for smaller-scale deployments, we started out using a simplistic pull model to mechanically acquire data from the HAN gateway devices. These devices gather data serially from their associated thermostat devices, making a single trip to a gateway-connected thermostat device cost a few seconds to tens of seconds. To come up with a data acquisition method that could scale, we needed something that is at least an order of magnitude faster.

With larger-scale deployments in the pipeline, we didn’t waste any time and worked collaboratively with all involved parties early on to build a scalable solution. We went back to the drawing board to scrutinize the various data communication methods that are supported by the WPAN specifications and laid out a few architectural changes. First, we switched the data acquisition model from pull to push. Such change affected not only data communications within our internal SaaS applications but the end-to-end data flow spanning across our partners’ PaaS systems.

One of the key changes was to come up with standards compliant methods that minimize necessary data retrievals via unexploited features such as attribute grouping and differential reporting under the push model. Attribute grouping allows selected attributes to be bundled as a single packet for delivery instead of spitting individual attributes serially in multiple deliveries. Differential reporting helps minimize necessary data deliveries by triggering data transfer only when at least one of the selected attributes has changed. All that means lots of extra work for everybody in the short term, but in exchange for a scalable solution in the long run.

Collaborative work pays off

The challenges mentioned above wouldn’t be resolvable hadn’t there been a team of cross-functional group technologists working diligently and creatively to make it happen. Performance was boosted by orders of magnitude after implementing the new data acquisition method. More importantly, the collective work in some way set a standard for large-scale data acquisition from SaaS-managed HAN devices. It was an invaluable experience being a part of the endeavor.

Big Data SaaS For HAN Devices

At one of the startups I co-founded in recent years, I was responsible for building a SaaS (Software-as-a-Service) platform to manage a rapidly growing set of HAN (Home Area Network) devices across multiple geographies. It’s an interesting product belonging to the world of IoT (Internet of Things), a buzzword that wasn’t popular at all back in 2007. Building such a product required a lot of adventurous exploration and R&D effort from me and my team, especially back then when SaaS and HAN were generally perceived as two completely segregated worlds. The company is EcoFactor and is in the energy/cleantech space.

Our goal was to optimize residential home energy, particularly in the largely overlooked HVAC (heating, ventilation, and air conditioning) area. We were after the consumer market and chose to leverage channel partners in various verticals including HVAC service companies, broadband service providers, energy retailers, to reach mass customers. Main focus was twofold: energy efficiency and energy load shaping. Energy efficiency is all about saving energy while not significantly compromising comfort, and energy load shaping primarily targets utility companies who have vast interest in reducing spikes in energy load during usage peak-time.

Home energy efficiency

Energy efficiency implementation requires intelligence derived from mass real-world data and delivered by optimization algorithms. Proper execution of such optimization isn’t trivial. It involves deep learning of HVAC usage pattern in a given home, analysis of the building envelope (i.e. how well-insulated the building is), the users’ thermostat control activities, etc. All that information is deduced from the raw thermal data spit out by the thermostats, without needing to ask the users a single question. And execution is in the form of programmatic refinement through learning over time as well as interactive adjustment in accordance with feedback from ad-hoc activities.

Obviously, local weather condition and forecast information is another crucial input data source for executing the energy efficiency strategy. Besides temperature, other information such as solar/radiation condition and humidity are also important parameters. There are quite a lot of commercial weather datafeed services available for subscription, though one can also acquire raw data for U.S. directly from NCDC (National Climatic Data Center).

Energy load shaping

Many utilities offer demand response programs, often with incentives, to curtail energy consumption during usage peak-time (e.g. late afternoon on a hot Summer day). Load reduction in a conventional demand response program inevitably causes discomfort experienced by the home occupants, leading to high opt-out rate that beats the very purpose of the program. Since the “thermal signature” of individual homes is readily available from the vast thermal data being collected around the clock, it didn’t take too much effort to come up with suitable load shaping strategy, including pre-conditioning, for each home to take care of the comfort factor while conducting a demand response program. Utility companies loved the result.

HAN devices

The product functionality described so far seems to suggest that: a) some nonexistent complicated device communications protocol is needed, and, b) in-house hardware/firmware engineering effort is needed. Fortunately, there were already some WPAN (Wireless Personal Area Network) protocols, such as ZigBee/IEEE 802.15.4, Z-Wave, 6loWPAN (and other wireless protocols such as WiFi/IEEE 802.11.x), although implementations were still in experimentation at the time I started researching into that space.

We wanted to stay in the software space (more specifically, SaaS) and focus more on delivering business intelligence out of the collected data, hence we would do everything we could to keep our product hardware- and protocol-agnostic. Instead of trying to delve into the hardware engineering world ourselves, we sought and adopted strategic partnership with suitable hardware vendors and worked collaboratively with them to build quality hardware to match our functionality requirement.

Back in 2007, the WPAN-based devices available on the market were too immature to be used even for field trials, so we started out with some IP-based thermostats each of which equipped with a stripped-down HTTP server. Along with the manufacturer’s REST-based device access service, we had our first-generation 2-way communicating thermostats for proof-of-concept work. Field trials were conducted simultaneously in both Texas and Australia so as to capture both Summer and Winter data at the same time. The trials were a success. In particular, the trial result answered the few key hypotheses that were the backbone of our value proposition.

WPAN vs WiFi

To prepare ourselves for large-scale deployment, a low-cost barebone programmable thermostat one can find in a local hardware store like Home Depot is what we were going after as the base hardware. The remaining requirement would be to equip it with a low-cost chip that can communicate over some industry-standard protocol. An IP-based thermostat requiring running ethernet cable inside a house is out of question for both deployment cost and cosmetic reasons which we learned a great deal from our field trials. In essence, we only considered thermostats communicating over wireless protocols such as WPAN or WiFi.

Next, WPAN won over WiFi because of the relatively less work required for messing with the broadband network in individual homes and the low-power specs that works better for battery-powered thermostats. Finally, ZigBee became our choice for the first mass deployment because of its relatively robust application profiles tailored for energy efficiency and home automation. Another reason is that it was going to be the protocol SmartMeters would use, and communicating with SmartMeters for energy consumption information was in our product roadmap.

ZigBee forms a low-power wireless mesh network in which nodes relay communications. At 250 kbit/s, it isn’t a high-speed protocol and can operate in the 2.4GHz frequency band. It’s built on top of IEEE 802.15.4 and is equipped with industry-standard public-key cryptography security. Within a ZigBee network, a ZigBee gateway device typically serves as the network coordinator device, responsible for enforcing the security policy in the ZigBee network and enrollment of joining devices. It connects via ethernet cable or WiFi to a broadband router on one end and communicates wirelessly with the ZigBee devices in the home. The gateway device in essence is the conduit to the associated HAN devices. Broadband internet connectivity is how these HAN devices communicate with our SaaS platform in the cloud. This means that we only target homes with broadband internet service.

The SaaS platform

Our very first SaaS prototype system prior to VC funding was built on a LAMP platform using first-generation algorithms co-developed by a small group of physicists from academia. We later rebuilt the production version on the Java platform using a suite of open-source application servers and frameworks supplemented with algorithms written in Python. Heavy R&D of optimization strategy and machine learning algorithms were being performed by a dedicated taskforce and integrated into the “brain” of the SaaS platform. A suite of selected open-source software including application servers and frameworks were adopted along with tools for development, build automation, source control, integration and QA.

Relational databases were set up initially to persist acquired data from the HAN devices in homes across the nation (and beyond). The data acquisition architecture was later revamped to use HBase as a fast data dumping persistent store to accommodate the rapidly growing around-the-clock data stream. Only selected data sets were funneled to the relational databases for application logics requiring more complex CRUD (create, read, update and delete) operations. Demanding Big Data mining, aggregation and analytics tasks were performed on Hadoop/HDFS clusters.

Under the software-focused principle, our SaaS applications do not directly handle low-level communications with the gateway and thermostat devices. The selected gateway vendor provides its PaaS (Platform-as-a-Service) which takes care of M2M (machine to machine) hardware communications and exposes a set of APIs for basic device operations. The platform also maintains bidirectional communications with the gateway devices by means of periodic phone-home from devices and UDP source port keep-alive (a.k.a. hole-punching, for inbound communications through the firewall in a typical broadband router). Such separation of work allows us to focus on the high-level application logics and business intelligence. It also allows us to more easily extend our service to multiple hardware vendors.

Algorithms

Obviously I can’t get into any specifics of the algorithms which represents collective intellectual work developed and scrutinized by domain experts since the very beginning of the startup venture. It suffices to say that they constitute the brain of the SaaS application. Besides information garnered from historical data, the execution also takes into account of interactive feedback from the users (e.g. ad-hoc manual overrides of temperature settings on the thermostat via the up/down buttons or a mobile app for thermostat control) and modifies existing optimization strategy accordingly.

Lots of modeling and in-depth learning of real-world data were performed in the areas of thermal energy exchange in a building, HVAC run-time, thermostat temperature cycles, etc. A team of quants with strong background in Physics and numerical analysis were assembled to just focus on the relevant work. Besides custom optimization algorithms, machine learning algorithms including clustering analysis (e.g. k-Means Clustering) were employed for various kinds of tasks such as fault detection.

A good portion of the algorithmic programming work was done on the Python platform primarily for its abundance of contemporary Math libraries (SciPy, NumPy, etc). Other useful tools include R for programmatic statistical analysis and Matlab/Octave for modeling. For good reasons, the quant team is the group demanding the most computing resource from the Hadoop platform. And Hadoop’s streaming API makes it possible to maintain a hybrid of Java and Python applications. A Hadoop/HDFS cluster was used to accommodate all the massive data aggregation operations. On the other hand, a relational database with its schema optimized for the quant programs was built to handle real-time data processing, while a long-term solution using HBase was being developed.

Putting everything together

Although elastic cloud service such as Amazon’s EC2 has been hot and great for marketing, our around-the-clock data acquisition model consists of a predictable volume and steady stream rate. So the cloud’s elasticity wouldn’t benefit us much, but it’s useful for development work and benchmarking.

Another factor is security, which is one of the most critical requirements in operating an energy management business. A malicious attack that simultaneously switches on 100,000 A/Cs in a metropolitan region on a hot Summer day could easily bring down the entire grid. Cloud computing service tends to come with less flexible security measure, whereas one can more easily implement enhanced security in a conventional hosting environment, and co-located hosting would offer the highest flexibility in that regard. Thus a decision was made.

That pretty much covers all the key ingredients of this interesting product that brings together the disparate SaaS and HAN worlds at a Big Data scale. All in all, it was a rather ambitious endeavor on both the business and technology fronts, certainly not without challenges – which I’ll talk about perhaps in a separate post some other time.

Streaming Real-time Data Into HBase

Fast-write is generally a characteristic strength of distributed NoSQL databases such as HBase, Cassandra. Yet, for a distributed application that needs to capture rapid streams of data in a database, standard connection pooling provided by the database might not be up to the task. For instance, I didn’t get the kind of wanted performance when using HBase’s HTablePool to accommodate real-time streaming of data from a high-parallelism data dumping Storm bolt.

To dump rapid real-time streaming data into HBase, instead of HTablePool it might be more efficient to embed some queueing mechanism in the HBase storage module. An ex-colleague of mine, who is the architect at a VoIP service provider, employs the very mechanism in their production HBase database. Below is a simple implementation that has been tested performing well with a good-sized Storm topology. The code is rather self-explanatory. The HBaseStreamers class consists of a threaded inner class, Streamer, which maintains a queue of HBase Put using LinkedBlockingQueue. Key parameters are in the HBaseStreamers constructor argument list, including the ZooKeeper quorum, HBase table name, HTable auto-flush switch, number of streaming queues and streaming queue capacity.

Next, write a wrapper class similar to the following to isolate HBase specifics from the streaming application.

To test it with a distributed streaming application using Storm, write a bolt similar to the following skeleton. All that is needed is to initialize HBaseStreamers from within the bolt’s prepare() method and dump data to HBase from within bolt’s execute().

Finally, write a Storm spout to serve as the streaming data source and a Storm topology builder to put the spout and bolt together.

The parallelism/queue parameters are set to relatively small numbers in the above sample code. Once tested working, one can tweak all the various dials in accordance with the server cluster capacity. These dials include the following:

For simplicity, only HBase Put is being handled in the above implementation. It certainly can be expanded to handle also HBase Increment so as to carry out aggregation functions such as count. The primary goal of this Storm-to-HBase streaming exercise is to showcase the using of a module equipped with some “elasticity” by means of configurable queues. The queueing mechanism within HBaseStreamers provides cushioned funnels for the data streams and helps optimize the overall data intake bandwidth. Keep in mind, though, that doesn’t remove the need of administration work for a properly configured HBase-Hadoop system.