It has been a while since I published my last blog post. Over the past couple of years, I was busy working with a small team of entrepreneurs on a startup, DwellAware, in the residential real estate space. What we set out to build is a contemporary web application that offers objective ratings derived from a wide sprectum of data sources for individual real estate properties.
Throughout the course of the startup venture, we maintained a skeletal staff including the no-fear CEO, the product czar, a UX designer, a couple of web app/backend engineers, a data scientist, and the engineering head (myself). The office was located in the SoMa district of San Francisco. Competing for top talent in the SF Bay is always a challenge but we were thrilled to have had some of the best talent forming the foundation team.
MVP and product-market fit
Our initial focus was to build a minimally viable product (MVP) and go through rapid iterations to achieve product-market fit. To maximize the velocity of our MVP iterations, we started out with a selected region, San Diego county. We listened to users feedback regularly and iterated continuously in accordance with the feedbacks. These feedbacks were diligently acquired thru interviews with people in local coffee shops, online usability testing as well as website activity analytics.
Eventually we arrived at a refined release and started to geographically expand from a single county to the entire California. Awaiting in the processing queue ready to be deployed were a number of states including Florida, Texas, New York, Illinois. The goal was to cover the 120+ millions of properties nationwide. We scaled the technology infrastructure as we expanded the geography and had all the key technology components in place.
Sadly we couldn’t quite make it to the finish line and had to wind down the operation. In the hindsight, perhaps there were mistakes made at both strategic and tactical level that led to the disappointing result and would warrant some hard analysis. That isn’t what this blog post is about. For now I would simply like to share some of the technological considerations and decisions made during the course of the venture.
DwellScore and HoodScore
A significant portion of the engineering work lied within the data science domain. In order to create an objective scoring system for individual properties in the nation that factors in hidden-cost (e.g. commute, maintenance) analysis, we exhausted various data sources from public census databases, open-source projects to commercial data providers, so as to establish a comprehensive data warehouse.
To help real estate agents/brokers to promote their listings, we derived badges (e.g. “Safe Neighborhood”, “Low Traffic Street”, “Top Rated High School”) and blended them into listings photos for qualified real estate properties in accordance with the calculated scores. The agents/brokers were free to circulate selected badged photos by resubmitting them to associated MLSes for on-going distribution.
One of the challenges was to validate and consolidate incomplete and sometimes inaccurate data from the various sources that are often times incompatible among themselves. Even data acquired with expensive license terms was often found erroneous and incomplete. We got to the point that we were going to redefine our own nation-wide neighborhood dataset in the next upgrade.
Nevertheless, we were able to come up with our first-generation scores for individual properties (DwellScore) and neighborhoods (HoodScore), backed by some extensive data science work that aggregate sub-scores in areas of cost analysis, crime rate, school districts, neighborhood lifestyle and economics. Among the sub-scores was a comfort score that includes a number of unique ingredients including noise. To come up with just noise ranking, we had to comb through data and heat maps related to aircraft , railroad and road traffic count, all from different sources.
The fact that a number of technology partners were interested in acquiring our data science work at the end of the venture does speak to its quality and comprehensiveness.
NLP & computer vision
Real estate listings have long been known for their lack of completeness and accuracy. There are hundreds of MLSes administered using disparate data management systems and possibly over a million real estate brokers/agents in the nation. As a result, listings data not only needs to be up-to-date, but should also be systematically validated in order to be trustable.
We experimented using of NLP (natural language processing) to help validate listings data by extracting and interpreting data of interest from latest free-form text entered by agents. In addition, we worked with a computer vision company to process massive volume of listings images via pattern recognition and machine learning. Certain characteristics of individual property listings, such as curb appeal, actual living area to lot size ratio, existence of power lines, etc, could be identified through computer vision.
Technology stack
We adopted Node.js as the core tech stack for our web-centric application. Python was used as the backend/data-mining platform for data processing tasks such as real estate listings import from MLSes as well as for data-science number crunching. In addition, we also developed data service APIs for internal consumption using Tornado servers to abstract Node.js from having to handle data processing routines.
MySQL was initially chosen as the database management system for OLTP data storage and data warehousing. While Python has a rich set of libraries for geospatial/GIS (geographic information system) which constitutes a significant portion of our core development work, on the database front it didn’t take long for us to hit the limit of geospatial capability offered by MySQL’s latest stable release. Apparently, PostgreSQL equipped with PostGIS has been the de facto database choice for most geospatialists in recent years. Understanding that a database transition was going to cost us non-trivial effort, it’s one of those uncompromisable actions we had to take. Switching the database platform was made easy with SQLAlchemy providing the ORM (object-relational mapping) abstraction layer on Python.
Geospatial search
GoogleMaps API has great features for maps, street views and geocoded address search, but there were still cases where a separate custom search solution could complement the search functionality. PostgreSQL has a Trigram (pg_trgm) module which maintains trigram-based indexes over text columns for similarity search. That helps add some crude NLP (natural language processing) capability to the search functionality necessary for more user-friendly geographical search (e.g. for property address).
While Postgres’ Trigram is a viable tool, it directly taxes on the database and could impact performance as the database volume continues to grow. To scale up search independently from the database, we picked Elasticsearch. Elasticsearch comes with a comprehensive set of functions for robust text search (partial match, fuzzy match, human language, synonym support, etc) via its underlying n-gram lexical analyzer. In addition, it also has basic functions for geolocation, supporting complex shapes in GeoJSON format. In brief, Elasticsearch appears to fit well into our search requirement.
Cloud computing platform
We picked Amazon AWS as our hosting and cloud computing platform, so using its CloudFront as the CDN was a logical step. Other readily available AWS services also offer useful tools in various areas. On the operation front, Route 53 is a DNS service one might find some competitive edge over other existing services out there. For instance, it supports setting up canonical name (CNAME) for the base domain name that many big-name DNS hosting services don’t. Amazon’s elastic load balancer (ELB) also makes load-balancing setup easy and allows centralized digital certificate setup. With wildcard digital certificate for a base domain name and a security policy that permits ending SSL/TLS at the load balancer, secure website setup could be made real simple.
Security-wise, AWS now offers a rather high degree of flexibility for role-based security policy and security group setup. On database, Amazon’s RDS provides a data persistence storage solution to shield one from having to deal with building and maintaining individual relational database servers. I had a lot of reservation when evaluating AWS security in a prior startup venture about its readiness to provide a production-grade infrastructure. I must say that it has improved a great deal since.
A fun run
Although the venture lasted just slightly over two years, it was a fun run. We fostered a culture of transparency and best-ideas-win. We also embraced risk-taking and fast-learning on many fronts, including adopting and picking up bleeding-edge technologies not entirely within our comfort zone. Below are a couple of pictures taken on the day the first production application was launched back in Summer of 2014:
The crowd in the engineering room
Launching the first web site