Our Journey Toward Powering Real Time Social Data DiscoveryPosted on 17 Sep
Here at LeadSift, we are staunchly committed to providing actionable consumer insights in real time. Today’s consumers are sharing their preferences and taking action right now on social media, and we believe our customers should have access to that data as it is created.
Our commitment to real-time data means that we are constantly ingesting huge amounts of social data, processing that data, and enabling our customers to run customized queries. It was our goal to create a system where a customer could ask to see a list of moms living in New York who own a car and are considering a new one – and get results instantly.
This enormous amount of processed data presents a significant technical challenge. How would we develop a system that could handle real time indexing to allow for an exploratory search on the fly? Since we wanted customers to be able to explore any arbitrary set of features in real time, we needed a robust, nimble solution.
We realized we needed to take a closer look at our database. Our system actively identifies thousands of consumer attributes for each lead that we process, meaning that we had to support aggregate queries that span across several dimensions. And with over 50 million Twitter users stored already, our database had ballooned to a size that made traditional approaches impractical. As the amount of data continued to grow, we needed to change our approach from using basic MongoDB queries, which, while flexible, was a major bottleneck.
Our First Approach: Flatter Data
The first solution that we developed was to “flatten” our MongoDB documents into a format that contained just the bare minimum amount of data needed for our application. We compute and store dozens of attributes for each lead, but not all of that data is required to display a consumer profile in our user interface. By converting a full set of data into a smaller set, we were able to minimize the amount of data that our queries needed to retrieve from disk. This solution did improve query performance for a short period of time, and it cut down the per-document storage needs considerably. However, this performance jump was not long-lived. In less than a month it became obvious that performance speed was unacceptable.
Our Second Approach: Bitmasks
Not content with the performance of our “flattening” solution, we decided to try a new approach using Redis. Inspired from the experiences shared by companies like Spool [http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/] and Belly [https://tech.bellycard.com/blog/light-speed-analytics-with-redis-bitmaps-and-lua/], this approach used bitmasks to store each feature, and bitwise operations to retrieve a count for each feature in a query.
For some time, we enjoyed the Redis solution. The real time exploration was much faster than before, serving up results in under a second. Unfortunately, we observed issues with concurrency in our Redis implementation. Since Redis is single threaded (only able to handle one request at any given moment), slower queries or large-scale writes (from bulk-inserts, for example) caused the Redis server to become unresponsive when serving multiple connections. Also, in order to maximize the performance gains, we chose to write our implementation using Redis’ Lua scripting support. This added an extra layer of complexity to our solution, as we encountered subtle language issues with Lua that made development difficult.
Our Third Approach: Back To MongoDB
Since we had partial success with the bitmask approach, we decided to try the method we’d attempted with Redis with MongoDB instead, given that our team has more experience using that database. With this attempt, we simply stored each feature bitmask in a MongoDB document.
This resulted in much faster query times than the original MongoDB solution, though not as fast as Redis. Since bitmasks were stored in simple MongoDB documents, this solution’s complexity was much reduced compared to Redis, and the data itself was easier to manipulate.
Despite early signs that this would be a long-term solution, we frequently found that we were hitting MongoDB’s maximum document size of 16 MB in testing. We collect several thousand attributes for hundreds of thousands of users daily, and those bits added up quickly. There was no manageable way to get around this limit, and we had to scrap the approach.
Our Current Solution
Using ideas we gleaned from our past approaches, we have built an in-house “segments” server that we are currently using with great success. To start, we wrote scripts to convert our data into large Python dictionaries containing feature bitmasks, and save them on disk to a pickled file. Next, we wrote a simple server with Python that, on startup, reads the dictionaries from all of the data files into memory. Once the data has loaded, we are able to make requests to the server over the network using the ZeroMQ messaging library to perform any query that we need.
We have been using this system in production for over two months now, and we’re quite happy with it so far. It is extremely fast, taking ~200ms to query 3 months of data, or approximately 5 million consumers aggregated across all of the attributes that we collect. Thus far we haven’t experienced any serious issues, though we are keeping an eye on the server’s climbing RAM usage.
While indexing, processing and querying real time data is no easy task, we’re dedicated to providing our clients with fast and uninterrupted access to valuable insights about their customers. We will continue to fine-tune our solution so that it consistently delivers the most efficient possible experience for our users.