Tag Archives: startup venture

Startup Culture 2.0

Startup has been a household term since early/mid 90′s when the World Wide Web began to take the world by storm. Triggered by the first graphical browser Mosaic available on all popular platforms, the blossoming of the Web all of a sudden opened up all sorts of business opportunities attracting entrepreneurs to try capitalize the newly born eye-catching medium.

A historically well known place for technology entrepreneurship, the Silicon Valley (or more precisely, San Francisco Bay Area) became an even hotter spot for entrepreneurs to swamp in. Many of these entrepreneurs were young energetic college graduates (or drop-outs) in some science/engineering disciplines who were well equipped to quickly learn and apply new things in the computing area. They generally had a fast-paced work style with a can-do spirit. Along with the youthful work-hard play-hard attitude, the so-called startup culture was born. Sun Microsystems was probably a great representative of companies embracing the very culture back in the dot-com days.

So, that was a brief history, admittedly unofficial, of the uprising of startup culture 1.0.

Setting up an open-space engineering room

Setting up an engineering workspace

Version 2.0

This isn’t another ex-dot-commer glorifying the good old startup culture in the dot-com days that later degenerated into a less commendable version observed today. I’m simply describing the gradual and, to some extent, subtle changes to the so-called startup culture I’ve observed over the years.

Heading into startup culture 2.0, besides an emphasis of fast-paced and can-do, along came a number of phenomenons including long hours, open-space and the Agile “movement”. Let’s dive a bit deeper into these phenomenons.

Long hours

In 1.0, we saw a lot of enthusiastic technologists voluntarily working long hours in the office. What’s different from the past is that long-hours is now often an “involuntary” phenomenon. The mindset that software engineers should be working long hours is so predominant that company management are often obsessed with the picture of all the techies diligently coding and brainstorming in the office. By pushing for long hours in the office, management are in essence commodifying software engineering work to become some hourly-paid kind of mechanical work. In most cases, it’s an unintentional act, but in others it’s often an act of distrust.

The fact is that serious software engineering requires serious brain-work. One can only deliver a limited number of hours of work on any given day in a productive manner. In other words, the actual amount of quality work would not increase beyond a few hours of serious brain-work. If you force the engineers to pull long hours in the office beyond the normal productive limit, you’re not improving actual work done but are instead killing the possibility of them to have any stamina or incentive left to contribute bonus work when they aren’t in the office. Even if they manage to overdraft themselves to deliver some extra work in a productivity-depleted state, they most likely will need to replenish their energy by taking time off the following day which would otherwise be productive time.

Flexibility in work hours and locale

Back in 1.0, general Internet connection speed was slow. It was common for a good-sized company with nationwide offices to share a T1 line as their Internet backbone, whereas today many residential consumers use connections at an order of magnitude faster than a T1. So, in order to carry out productive work back then, people had to go to the office, thus oftentimes you could find software engineers literally camping in the office.

Given the vastly improved residential Internet infrastructure today, much of the engineering work that used to be doable only in the office in the past can be done at home. So, if a software engineer already has regular office presence, there is little to no reason to work long hours in the office. In fact, other than pre-scheduled group meetings and white-boarding sessions, engineers really don’t have to stay in the office all the time, especially for those who have to endure bad commute.

Open space

Open office space has been advocated since the middle of the 1.0 era. Common office cubes used to be 5-1/2 feet or taller in the old days. People later began to explore opening up a visually-cluttered office space by cutting down about a foot from the cube wall, allowing individuals to see each other when standing up while keeping some privacy when sitting down. Personally, I think that’s the optimal setting. But then in recent years, lots of startups adopted cubes with walls further lowered or completely removed, essentially enforcing a multicast communication protocol all the time for individuals. In addition, the vast open view would also ensure constant visual distractions.

Bear in mind software engineers need good chunks of solitary time to conduct their coding work. Such a multicast plus visually distracting environment isn’t going to provide a productive environment for them. It’s understandable for a cash-strapped early-stage startup to adopt a temporary economical seating solution like that, but I’ve seen many well-funded companies building out such workspace as their long-term work environment.

Agile

Agile software development has been very popular these days. Virtually every software engineering organization is practicing some sort of Agile process. There are already lots of insights about pros and cons of the Agile methodology out there, so I’m not going to get into it. It suffices to say I like certain core Agile practices like 2-week development sprint, daily 15-minute scrum, and continuous integration which I think are applicable to many software development projects. However, I have reservation in mechanically adopting everything advocated in the Agile methodology. Which aspects of the Agile process are to be adopted for an engineering organization should be evaluated, carried out and adjusted in accordance with engineers skillset, project type and operation environment. The primary goal should be for efficiency and productivity, not because Agile sounds cool.

Insecurity and distrust?

Underneath the phenomenons described above, I think there are some signs of insecurity and distrust when viewed from a certain angle.

For most startups, every penny counts and the high cost of keeping a team of software engineers is well known. Thus it’s understandable that management tend to be nervous about whether their engineering hires deliver work in a timely manner. Long hours in office would help make the management believe that things are moving along at above the average speed. Open space further helps overcome their seeing-is-believing insecurity. Frequent sprints and daily scrums help make them feel that things are constantly being cranked out and status visibility is at its maximum granularity.

If that’s the general sentiment felt by the software engineers, most likely they won’t be happy and the most competent ones will likely be the first to flee. Nor will the management be happy when they don’t see the expected high productivity and find it hard to retain top engineers. The end result is that despite all the rally of fun, energetic startup culture on the company’s hiring web page, you’ll hardly experience any fun or energy there.

What can be done better?

Management:

  1. Give your staff benefit of doubt — It’s hard to let go of the doubt about whether people are working as hard as expected, but pushing for long hours in the office and keeping them visually exposed at their desks only send a signal of distrust and insecurity. It’ll only backfire and may just result in some kind of conditional attendance that they superficially attend the office only when they know you’re around. I would also recommend making the work hours as flexible as possible. On working remote, a pre-agreed arrangement of telecommuting would go a long way for those who must endure horrible commute. People with enough self-respect would appreciate demonstrated trust from the management and it would make them enjoy their job more.
     
  2. Work healthy — Work hard and play hard is almost a synonym of startup culture that we believe fun is a critical aspect in a startup environment. Throwing in a ping pong or foosball table would not automatically spawn a fun environment. In building out a work environment, I would argue that “work healthy” perhaps should replace “fun” as the primary initiative. I think providing a healthy working environment will lead to happier staff, better productivity, and fun will come as a bonus. Common ways to achieve that include ergonomic office furniture, natural lighting, facility for exercise, workout subsidy programs, healthy snacks or even a room for taking naps. Speaking of naps, I think it’s worth serious consideration to embrace it as part of the culture. Evidently, a 15-30 minutes of nap after lunch can do magic in refreshing one’s mood and productivity for the remaining half of the day.
     
  3. Adopt Agile with agility — Take only what best suit your staff’s skillset, project type and operational environment. Stick to the primary goal of better productivity and efficiency, and add/remove Agile practices as needed. It’s also important that you regularly communicate with the staff for feedback and improvement, and make adaptive changes to the practices if necessary.
     
  4. Make meetings worth attending — Meetings are common and important activities in pretty much all businesses. But oftentimes meetings aren’t conducted in an efficient manner. One common cause is having too many agendas packed in a single session, resulting in a lengthy and exhaustive meeting. Another common cause is that the host hasn’t prepared well enough in advance. But in many cases when the host did prepare themselves, the meeting could still fail because the participants weren’t demanded ahead to also prepare for it. This is particularly common when the meeting requires input from participants. With unprepared participants joining a meeting, the host will likely have to spend much of the meeting time explaining the obvious during the more productive time of a meeting when participants still have a relatively fresh brain.
     
  5. Lead by example — Too often are we seeing management handing down a set of rules to the staff while condescendingly assuming the rules don’t apply to themselves. It’s almost certain such rules will at best be followed superficially. Another commonly seen phenomenon is that management rally to create a culture which conflicts in many ways with their actual belief and style. They do it just because they were told they must create a culture to run a startup operation, but they ignore the fact that culture can only be built and fostered with themselves genuinely being a part of it. It cannot be fabricated.
     

Individuals:

  1. Honor the honor system — There may be various reasons contributing to the commonly seen distrust by the management. Unsurprisingly, one of them comes directly from individuals who abuse some of the employee benefits meant to be used on discretion. Perhaps the most common case is claiming the need for work-from-home with made-up reasons or without actually putting in the hours to work. Well, you can’t blame people’s distrust in you unless you first honor the honor system . For instance, when you do work from home, stick to the actual meaning of work-from-home. Also, making your availability known to those who need to work closely with you would be helpful. One effective way, especially for a relatively small team, would be to have a shared group calendar account designated for showing up-to-date team members availability.
     
  2. Self discipline — Again, using work-from-home as an example, one needs to have the self-discipline to actually put in decent amount of hours to work. It’s easy to be distracted, for instance by your kids, when working at home, but it’s your own responsibility to make necessary arrangement to minimize any expected distractions. It’s also your obligation to make it clear to your teammates in advance when you will be unavailable for scheduled appointments or what-not.
     
  3. Reasonable work priority — When it comes to priority in life, nothing compares to one’s family. Not even your job, as you can change job but not your family. So, for unplanned urgent family matters, no one will complain if you drop everything to take care of them. However, that doesn’t necessarily justify you should frequently compromise your attendance to work for reasons like picking up your kids from school, unless that’s a pre-agreed work schedule. Bottom line, if your career matters to you, you shouldn’t place your job responsibility way down your priority list from general family routines.
     
  4. Active participation — Most software engineers hate meetings, feeling that they consume too much of their time which could otherwise be used for actually building the product. I think if the hosts do their job well in prepping for meetings and the participants also do their prep work with regard to the agendas, most meetings can benefit everyone without any feeling of time being wasted. Even as a participant, attending a meeting without any prep work carrying a feed-me mindset will likely feel tedious and boring, unless it’s just some kind of quick announcement-only meeting. When you’ve prepared before a meeting, chances are that you will be a lot more engaged in the discussion and be able to contribute valuable input. Such active participation stimulates collective creativity and fosters a culture of “best ideas win”.
     
  5. Keep upgrading yourself — This may sound off-topic, but keeping yourself abreast of knowledge and best-practices in the very area of your core job responsibilities does help shape the team culture. Constant self-improvement will naturally boost up one’s confidence in their own domain which, in turn, facilitates efficient knowledge exchange and stimulates healthy competition. All that helps promote a high-efficiency no-nonsense culture. The competitive aspect presents healthy challenge to individuals, as long as excessive egos don’t get in the way. On upgrading, between breath and depth I would always lean toward depth. These days it’s too easy to claim familiarity of all sorts of robust frameworks and libraries on the surface, but the most wanted technologists are often the ones who demonstrated in-depth knowledge, say, down to the code level of a software library.
     

Final thoughts

Startup culture 1.0 left us a signature work style many aspire to embrace. It has evolved over the years into a more contemporary 2.0 that better suits modern software development models in various changeable competitive spaces. But it’s important we don’t superficially take the surface values of all the hyped-up buzzwords and mechanically take them as gospel. The said culture should be selectively embraced, adopted and adjusted in accordance with the team’s strength and weaknesses, project type, etc. More importantly, whatever embraced culture should never be driven by distrust or insecurity.

Adopting Node.js In The Core Tech Stack

At a startup company, DwellAware, I’ve been with recently, I was tasked to build a web-centric application with a backend for comprehensive data analytics in the residential real estate space. Nevertheless, this post is not about the startup venture. It’s about Node.js, the technology stack chosen to power the application. Programming platforms considered at the beginning of the venture include Scala/Play, PHP/Laravel, Python/Twisted, Ruby/Sinatra and Javascript/Node.js.

Neither is it a blog post about comparing programming platforms. I’m going to simply state that Node.js was picked mainly for a few reasons:

  1. its lean-and-mean minimalist design principle is in line with how I would like to run things in general,
  2. its event-driven, non-blocking-I/O architecture is well suited for contemporary high-concurrency web-centric applications, and,
  3. keeping the entire web application to a single programming platform, since contemporary client-side features are heavily and ubiquitously implemented using Javascript anyway.

Is adopting Node.js a justifiable risk?

In fact, that was the original title of the blog post. I was going to blog about the necessary research for adopting Node.js as the tech stack for the core web application back in 2013. It never grew to more than a few bullet points and was soon buried deep down the priority to-do list.

Javascript has been used on the client side in web applications for a long time. Handling non-blocking events triggered by human activities on a web browser is one thing, dealing with split-second server events and I/O activities on the server side in a non-blocking fashion is a little different. Node.js’s underlying event-driven non-blocking architecture does help somewhat flatten the learning curve to Javascript developers.

Although new Node modules emerged almost daily to try address just about anything in any problem space one could think of, not many of them prove to be very useful, let alone production-grade. That was two years ago. Admittedly, a lot has changed over the past couple of years and Node has definitely become more mature everyday. By most standards, Node.js is still a relatively young technology though.

Anyway, let’s rewind back to Fall 2013.

Built on Google’s V8 Javascript engine, Node is a Javascript-based server platform designed to efficiently run I/O-intensive server applications. For a long time, Javascript was being considered a client-side-only technology. Node.js has made it a serious contender for server-side technology. The fact that prominent software companies such as Microsoft, eBay, LinkedIn, adopted Node.js in some of their products/services was more or less testimonial. While hypes about certain seemingly arbitrary technologies have always been a phenomenon in the Silicon Valley, I wouldn’t characterize the recent uprising of Javascript and Node a mere hype.

Node.js modules

Node by itself is just a barebone server, hence picking suitable modules was one of the upfront tasks. One of the core modules that was an essential part of Node’s middleware framework is Connect, which provides chaining of functions and enhances Node’s http module. ExpressJS further equips Node with rich web app features on top of Connect. To take advantage of multi-core/processor server configuration, Node offers a method child_process.fork() for spawning worker processes that are capable of communicating with their parent via built-in IPC (Inter-process Communication).

On build tool, we started out with Grunt then later shifted to Gulp partly for the speed due to Gulp’s streaming approach. But we were happy with Grunt as well. Node uses Jade as its default templating engine. We didn’t like the performance, so we evaluated a couple of alternate templating engines including doT.js and Swig, and were shocked to see performance gain in an order of magnitude. We promptly switched to Swig (with doT.js a close second).

On test framework, we used Mocha.js with assertion libray, Chai.js, which supports BDD (Behavior-driven Development) assertion style.

Data persistence, caching, content delivery, etc

A key part of our product offerings is about data intelligence, thus databases for both OLTP and warehousing are critical components of the technology stack. MongoDB has been a default database choice for many Node.js applications for good reasons. The emerging MEAN (MongoDB-ExpressJS-AngularJS-Node.js) framework hints the popularity of the Node-MongoDB combo. So Mongo was definitely a considered database. After careful consideration, we decided to go with MySQL. One consideration being that it wouldn’t be too hard to hire a DBA/devops with MySQL experience given its popularity. Both Node.js and MongoDB are relatively new products and we didn’t have in-house MongoDB experise at the time, so taming one beast (Node in this case) at a time was a preferred route.

There weren’t many Node-MySQL modules out there, though we managed to adopt a simplistic MySQL module that also provides simple connection pooling. Later on, due to the superior geospatial functionality of PostGIS available in the PostgreSQL ecosystem, we migrated from MySQL to PostgreSQL. Thanks to the vast Node.js module repository, there were Node-PostgreSQL modules readily available for connection pooling. To cache frequently referenced application data, we used Redis as a centralized cache store.

Besides dynamic content rendered by application, we were building a web presence also with a lot of static content of various types including images and certain client-side application data. To serve static web content, a few typical approaches, including using a proxy web server, content delivery network (CDN), have been reviewed. On proxy server, Nginx has been on its rise to overtake Apache to become the most popular HTTP server. Its minimalist design is kind of like Node’s. We did some load-testing of static content on Node which appears to be a rather efficient static content server. We decided a proxy server wasn’t necessary at least in the immediate term. As to CDN, we used Amazon’s CloudFront.

Score calculation & image processing

Part of the core value proposition of the product was to come up with objective scores in individual residential real estate properties and neighborhoods so as to help users to make intelligent choice in buying/selling their homes. As described in a previous blog post, a lot of data science work in a wide spectrum of areas (cost analysis, crime, schools, comfort, noise, etc) was performed to generate the scores.

Based on the computed scores, we then derived badges for qualified real estate properties in different areas (e.g. “Low Energy Bills”, “Safe Neighborhood”, “Top Rated Elementary School”). The badges were embedded in selected photos of individual real estate properties, which could then be fed back into the listings distribution cycle by resubmitting into the associated MLSes if the real estate agents/brokers chose to.

All the necessary score calculation and image processing for badges were done in the backend on a Python platform with PostgreSQL databases. Python Tornado servers were used as data service API along with basic caching for Node.js to consume data as presentation content.

Here’s a screen-shot of the Dwelling Page for a given real estate property, showing its DwellScore:

DwellAware DwellScore

Geospatial maps & search

For geographical maps and search, Google Maps API was extensively used from within Node.js. We gecoded in advance all real estate property addresses using the API as part of the backend data processing routine so as to take advantage of Google’s superior search capability.

To supplement the already pretty robust Google Maps search from within Node.js to better utilize our own geospatial data content, we experimented using an Elasticsearch module which comes with their n-gram lexical analyzer for fuzzy-match search. The test result was promising. An advantage of using such an autonomous search system is that it doesn’t directly tax on the Node.js server or the PostgreSQL database (e.g. pg_trgm) as traffic load increases.

Below is a screen-shot of the Search Page centering around San Diego:

DwellAware Search

Fast-forward to the present

As mentioned earlier, Node.js has evolved quite a bit over the past couple of years — the rather significant feature/performance improvements from the v0.10 to v0.12, the next LTS (long-term-support) release incorporating the latest V8 Javascript engine and ES6 ECMA features, the fork-off to io.js which later merged back to Node, …, all sound promising and exciting.

In conclusion, given the evident progress of Node’s development I’d say it’s now hardly a risk to adopt Node for building general web-centric applications, provided that your engineering team possesses sufficiently strong Javascript skills. It wasn’t a difficult decision for me two years ago to pick Node as the core technology stack, and would be an even easier one today.

For more screen-shots of the website, click here.

Yet Another Startup Venture

It has been a while since I published my last blog post. Over the past couple of years, I was busy working with a small team of entrepreneurs on a startup, DwellAware, in the residential real estate space. What we set out to build is a contemporary web application that offers objective ratings derived from a wide sprectum of data sources for individual real estate properties.

Throughout the course of the startup venture, we maintained a skeletal staff including the no-fear CEO, the product czar, a UX designer, a couple of web app/backend engineers, a data scientist, and the engineering head (myself). The office was located in the SoMa district of San Francisco. Competing for top talent in the SF Bay is always a challenge but we were thrilled to have had some of the best talent forming the foundation team.

MVP and product-market fit

Our initial focus was to build a minimally viable product (MVP) and go through rapid iterations to achieve product-market fit. To maximize the velocity of our MVP iterations, we started out with a selected region, San Diego county. We listened to users feedback regularly and iterated continuously in accordance with the feedbacks. These feedbacks were diligently acquired thru interviews with people in local coffee shops, online usability testing as well as website activity analytics.

Eventually we arrived at a refined release and started to geographically expand from a single county to the entire California. Awaiting in the processing queue ready to be deployed were a number of states including Florida, Texas, New York, Illinois. The goal was to cover the 120+ millions of properties nationwide. We scaled the technology infrastructure as we expanded the geography and had all the key technology components in place.

Sadly we couldn’t quite make it to the finish line and had to wind down the operation. In the hindsight, perhaps there were mistakes made at both strategic and tactical level that led to the disappointing result and would warrant some hard analysis. That isn’t what this blog post is about. For now I would simply like to share some of the technological considerations and decisions made during the course of the venture.

DwellScore and HoodScore

A significant portion of the engineering work lied within the data science domain. In order to create an objective scoring system for individual properties in the nation that factors in hidden-cost (e.g. commute, maintenance) analysis, we exhausted various data sources from public census databases, open-source projects to commercial data providers, so as to establish a comprehensive data warehouse.

To help real estate agents/brokers to promote their listings, we derived badges (e.g. “Safe Neighborhood”, “Low Traffic Street”, “Top Rated High School”) and blended them into listings photos for qualified real estate properties in accordance with the calculated scores. The agents/brokers were free to circulate selected badged photos by resubmitting them to associated MLSes for on-going distribution.

One of the challenges was to validate and consolidate incomplete and sometimes inaccurate data from the various sources that are often times incompatible among themselves. Even data acquired with expensive license terms was often found erroneous and incomplete. We got to the point that we were going to redefine our own nation-wide neighborhood dataset in the next upgrade.

Nevertheless, we were able to come up with our first-generation scores for individual properties (DwellScore) and neighborhoods (HoodScore), backed by some extensive data science work that aggregate sub-scores in areas of cost analysis, crime rate, school districts, neighborhood lifestyle and economics. Among the sub-scores was a comfort score that includes a number of unique ingredients including noise. To come up with just noise ranking, we had to comb through data and heat maps related to aircraft , railroad and road traffic count, all from different sources.

The fact that a number of technology partners were interested in acquiring our data science work at the end of the venture does speak to its quality and comprehensiveness.

NLP & computer vision

Real estate listings have long been known for their lack of completeness and accuracy. There are hundreds of MLSes administered using disparate data management systems and possibly over a million real estate brokers/agents in the nation. As a result, listings data not only needs to be up-to-date, but should also be systematically validated in order to be trustable.

We experimented using of NLP (natural language processing) to help validate listings data by extracting and interpreting data of interest from latest free-form text entered by agents. In addition, we worked with a computer vision company to process massive volume of listings images via pattern recognition and machine learning. Certain characteristics of individual property listings, such as curb appeal, actual living area to lot size ratio, existence of power lines, etc, could be identified through computer vision.

Technology stack

We adopted Node.js as the core tech stack for our web-centric application. Python was used as the backend/data-mining platform for data processing tasks such as real estate listings import from MLSes as well as for data-science number crunching. In addition, we also developed data service APIs for internal consumption using Tornado servers to abstract Node.js from having to handle data processing routines.

MySQL was initially chosen as the database management system for OLTP data storage and data warehousing. While Python has a rich set of libraries for geospatial/GIS (geographic information system) which constitutes a significant portion of our core development work, on the database front it didn’t take long for us to hit the limit of geospatial capability offered by MySQL’s latest stable release. Apparently, PostgreSQL equipped with PostGIS has been the de facto database choice for most geospatialists in recent years. Understanding that a database transition was going to cost us non-trivial effort, it’s one of those uncompromisable actions we had to take. Switching the database platform was made easy with SQLAlchemy providing the ORM (object-relational mapping) abstraction layer on Python.

Geospatial search

GoogleMaps API has great features for maps, street views and geocoded address search, but there were still cases where a separate custom search solution could complement the search functionality. PostgreSQL has a Trigram (pg_trgm) module which maintains trigram-based indexes over text columns for similarity search. That helps add some crude NLP (natural language processing) capability to the search functionality necessary for more user-friendly geographical search (e.g. for property address).

While Postgres’ Trigram is a viable tool, it directly taxes on the database and could impact performance as the database volume continues to grow. To scale up search independently from the database, we picked Elasticsearch. Elasticsearch comes with a comprehensive set of functions for robust text search (partial match, fuzzy match, human language, synonym support, etc) via its underlying n-gram lexical analyzer. In addition, it also has basic functions for geolocation, supporting complex shapes in GeoJSON format. In brief, Elasticsearch appears to fit well into our search requirement.

Cloud computing platform

We picked Amazon AWS as our hosting and cloud computing platform, so using its CloudFront as the CDN was a logical step. Other readily available AWS services also offer useful tools in various areas. On the operation front, Route 53 is a DNS service one might find some competitive edge over other existing services out there. For instance, it supports setting up canonical name (CNAME) for the base domain name that many big-name DNS hosting services don’t. Amazon’s elastic load balancer (ELB) also makes load-balancing setup easy and allows centralized digital certificate setup. With wildcard digital certificate for a base domain name and a security policy that permits ending SSL/TLS at the load balancer, secure website setup could be made real simple.

Security-wise, AWS now offers a rather high degree of flexibility for role-based security policy and security group setup. On database, Amazon’s RDS provides a data persistence storage solution to shield one from having to deal with building and maintaining individual relational database servers. I had a lot of reservation when evaluating AWS security in a prior startup venture about its readiness to provide a production-grade infrastructure. I must say that it has improved a great deal since.

A fun run

Although the venture lasted just slightly over two years, it was a fun run. We fostered a culture of transparency and best-ideas-win. We also embraced risk-taking and fast-learning on many fronts, including adopting and picking up bleeding-edge technologies not entirely within our comfort zone. Below are a couple of pictures taken on the day the first production application was launched back in Summer of 2014:

The crowd in the engineering room
Engineering room

Launching the first web site
Launching 1st web site

Challenges Of Big Data + SaaS + HAN

This is part two of a previous post about building and operating a Big Data SaaS for Home Area Network devices during my 5-year tenure with EcoFactor. Simply put, our main goal was to add “smarts” to residential heating and cooling systems (i.e. heaters and air conditioners, a.k.a. HVAC) via ordinary thermostats. That focus led to a superficial perception by some people that we’re a smart thermostat device company. In actuality, we have always been a software service, virtually agnostic to both hardware and communications protocol. It’s more of an IoT version of the “Intel Inside” business model.

Challenges from all fronts

Like building any startup company, there was a wide spectrum of challenges confronting us which is what this post is going to talk about. Funding environment was pretty hellish as we started just shortly before the financial crisis in 2007-2008. And failure of some high-profile solar companies in subsequent years certainly didn’t help make the once hyped cleantech a favorable sector for investors.

The ever-growing fierce competition for software engineering talent was and has been a big challenge for pretty much every startup in the Silicon Valley. On the technology front, production-grade open-source Big Data technologies weren’t there, leading to the need for a lot of internal R&D effort by individual companies, which in turn requires domain experts in both development and operations who were scarce endangered species back then, thus completing the vicious infinite loop that starts with the hiring difficulty.

Operational processes

On the operational front, there was a long list of processes that need to be carefully established and managed – from user acquisition, on-boarding, device installer training, scheduling coordination for on-site device installation, technical support for installers, to customer service. To get into the details of how all that was done warrants writing a book. In charge of product and marketing, Scott Hublou who is also a co-founder of the company owned the “horrendous” list.

Many of the items in the list are correlated. For instance, getting HVAC technicians to create a HAN network and pair up thermostats with the HAN gateway during an on-site installation not only required a custom-built software tool with a well-thoughtout workflow and easy UI, but also thorough training and a knowledgeable support team to back them up for ad-hoc troubleshooting.

Back to the engineering side of the world, a key piece in operations is the technology infrastructure that needs to cope with future business growth. That includes systems hosting, network and data architecture, server clusters for distributed computing, load balancing systems, fail-over and monitoring mechanism, firewalls, etc. As a startup company, we started with something simple but expandable to conserve cash, and scaled up as quickly as necessary. That’s also a practical approach from the design point of view to avoid over-engineering.

State of WPAN

On hardware, applicable HAN communications protocol and HAN device hardware were far from ready for mass deployment at the time when we started exploring in that space. That’s a non-trivial challenge for anybody who wants to get into the very space. On the other hand, if done right it represents an opportunity for one to pioneer in a relatively new arena.

ZigBee, an IEEE 802.15.4 standard WPAN (Wireless Personal Area Network) protocol, was our selected communications protocol for scaled deployment. While it’s a robust protocol compared with others such as Z-Wave, its specifications was still undergoing changes and few real-world implementations had ever exploited its full features.

The protocol comes with a few predefined application profiles including Energy Efficiency and Home Automation profiles. Part of our core business is about translating HVAC operations data via thermostats into actionable business intelligence, hence ability to acquire key attributes from these devices is crucial. We quickly discovered that some attributes as basic as HVAC state were missing in certain application profiles and we had to not only utilize multiple profiles but also extend to using custom attributes in ZCL (ZigBee Cluster Library).

Working with technology partners

Working with hardware technology partners does present some other challenges. HAN device firmware and embedded software development is a totally different beast from SaaS/server application development. Python on Linux is a prominent embedded software platform. While that’s also a popular combo for server software development, the two worlds share little resemblance. Building a system that bridges the two worlds takes learning and collaborative effort from both camps.

Some of our HAN device partners were quick to realize the significance of the need to back their gateway devices with a scalable PaaS infrastructure and invest significant effort in M2M (Machine-to-Machine) through acquisition and internal development. But coming from a hardware background, there was inevitably a non-trivial learning curve for our hardware partners to get it right in areas such as software service scalability. Leveraging our internal scalable SaaS development experience and our partners’ embedded software engineering expertise, we managed to put together the best ingredients from both worlds into the cooperative work.

OTA firmware update

OTA (Over-the-Air) firmware update generally refers to wireless firmware update. Our devices run on a WPAN protocol and the firmware is OTA-able. It’s probably one of the operations that create the most anxiety, as an update failure may result in “bricking” the devices in volume, leading to the worst user experience. A bricked thermostat that results in an inoperable HVAC (i.e. heater / air conditioner) would be the last thing the home occupant wants to deal with on a 105F Summer day, or worse, a potentially life-threatening hazard on a 10F Winter night.

This critical task is all about making sure the entire update procedure is foolproof from end to end. The important thing is to go through lots of rehearsals in advance. In addition, the capability of rollback of firmware version is as critical as the forward-update so to undo the update should unforeseen issues arise post-update. Startups typically work at a cut-throat pace that it’s tempting to circumvent pre-production tests whenever possible. But this is one of those operations that even a minor compromise of stringent tests could mean end of business.

Pull vs Push

The around-the-clock time series data acquisition from a growing volume of primitive HAN devices is a capacity-intensive requirement. Understanding that it was going to be a temporary method for smaller-scale deployments, we started out using a simplistic pull model to mechanically acquire data from the HAN gateway devices. These devices gather data serially from their associated thermostat devices, making a single trip to a gateway-connected thermostat device cost a few seconds to tens of seconds. To come up with a data acquisition method that could scale, we needed something that is at least an order of magnitude faster.

With larger-scale deployments in the pipeline, we didn’t waste any time and worked collaboratively with all involved parties early on to build a scalable solution. We went back to the drawing board to scrutinize the various data communication methods that are supported by the WPAN specifications and laid out a few architectural changes. First, we switched the data acquisition model from pull to push. Such change affected not only data communications within our internal SaaS applications but the end-to-end data flow spanning across our partners’ PaaS systems.

One of the key changes was to come up with standards compliant methods that minimize necessary data retrievals via unexploited features such as attribute grouping and differential reporting under the push model. Attribute grouping allows selected attributes to be bundled as a single packet for delivery instead of spitting individual attributes serially in multiple deliveries. Differential reporting helps minimize necessary data deliveries by triggering data transfer only when at least one of the selected attributes has changed. All that means lots of extra work for everybody in the short term, but in exchange for a scalable solution in the long run.

Collaborative work pays off

The challenges mentioned above wouldn’t be resolvable hadn’t there been a team of cross-functional group technologists working diligently and creatively to make it happen. Performance was boosted by orders of magnitude after implementing the new data acquisition method. More importantly, the collective work in some way set a standard for large-scale data acquisition from SaaS-managed HAN devices. It was an invaluable experience being a part of the endeavor.