Heap Powers Advanced Web & Mobile Analytics on Massive Amounts of Data With Citus
memory across a 40-node Citus cluster
events in Citus database cluster
The challenge—finding the optimal database to enable scale and complex analytics
Heap automatically captures every user action on a company’s web, iOS, and android application—clicks, taps, swipes, form submissions, page views—for over 6,000 customers. And Heap lets their customers immediately query all of that information. Other similar solutions are limited to tracking a handful of activities to avoid compromising performance. As a result, when questions arise about untracked activities, an engineer must first start logging the new activity, resulting in additional costs and lengthy delays while the data is being collected.
Heap automatically captures all the data, all the time. The challenge with this approach is that Heap is collecting up to 100 times as much data as other solutions, resulting in a far more complex data infrastructure. For the first generation of Heap, the company used Amazon Redshift. The solution worked well but the Redshift data model was not optimal for the types of analyses that Heap wanted to perform. As a result, Heap could not scale its solution to accommodate the increasing number of websites and applications that Heap anticipated onboarding, and going after larger websites was out of the question.
Heap looked at a number of approaches to rewriting its solution and chose the Citus database because of its Postgres lineage and the ability to easily scale out a database cluster of commodity servers with Citus. Dan Robinson, CTO at Heap, was brought in to scale the company's infrastructure, and his first project was to move the product onto a Citus database backend. Today, all of Heap’s data for all of its customers lives in a Citus database cluster, which also powers all of Heap’s analytics.
“We are essentially a database company built around a Citus cluster. Without the right database in place, we would likely have failed,” said Robinson. “With Citus, we are able to rapidly scale our business, and the response of our customers has been fantastic.”
Thanks to Citus, we’re powering a product that is really kind of magical for a lot of customers. The speed and performance of our database make it possible for them to immediately perform truly advanced analytics on any user activities they want to explore. All the data is already there.
Dan Robinson, CTO at Heap
Heap and Citus Data: advanced analytics on huge amounts of data
Heap automatically captures every activity of every visitor and user of a customer’s website, iOS, and android application, so that Heap's customers can easily and immediately perform queries on the data. Citus has enabled Heap to offer customers very advanced query capabilities. For example, an analyst could define a cohort as “users who have uploaded a photo and have logged in three times in the last week and accepted a friend request in the last month,” and then filter a conversion funnel for people in that cohort.
Both existing customers and prospects are responding to these capabilities with tremendous excitement. “Thanks to Citus, we’re powering a product that is really kind of magical for a lot of customers,” said Robinson. “The speed and performance of our database make it possible for them to immediately perform truly advanced analytics on any user activities they want to explore. All the data is already there.”
Being on a fully relational data model with the full indexing and querying power of Postgres was extremely valuable when it came to adding new analytics capabilities. Yet Citus also enabled us to maintain very fast response times, even with extremely complex queries.
Dan Robinson, CTO at Heap
Cost-effective scale plus lightning-fast analytics
Citus has enabled Heap to cost-effectively scale its Postgres database horizontally across a cluster, which has now grown to 40 nodes that share approximately ~10 TB of memory. Each commodity server in the Citus distributed database cluster has 16 CPU cores and 3.2 TB solid state drives (SSDs). Heap now has approximately 500 TB of data on disk, upwards of 500 billion events. Citus has also enabled Heap to maintain a mostly relational query model, providing all the benefits Heap needs from a traditional Postgres database without the single-node restriction, while also facilitating lightning-fast analytics.
A positive collaboration with the team at Citus Data
In addition to using the core Citus database, Heap worked with the Citus team to have custom extensions written related to funnel computations and behavior and retention analysis. “From the start, Citus has been extremely responsive to our needs,” said Robinson. “Our custom extensions have unlocked very powerful new features for our customers. It may have been possible to do some of the same analysis without these extensions, but it would definitely have been a lot slower.”
Citus gives Heap the indexing and querying power of PostgreSQL—at scale
Because Citus is an extension to PostgreSQL, Heap was able to leverage its extensive knowledge of PostgreSQL and the Postgres ecosystem. Citus also enabled Heap to use a data model that would support some of the more advanced analytics that the company wanted to make available to customers. “Being on a fully relational data model with the full indexing and querying power of Postgres was extremely valuable when it came to adding new analytics capabilities. Yet Citus also enabled us to maintain very fast response times, even with extremely complex queries,” said Robinson.
Thanks to Citus, we’re powering a product that is really kind of magical for a lot of customers. The speed and performance of our database make it possible for them to immediately perform truly advanced analytics on any user activities they want to explore.
Dan Robinson, CTO at Heap
With analytics infrastructure from Heap, you can auto-track the entire customer journey. Heap automatically captures every customer touchpoint: every click, tap, swipe, form change, and more. No more tracking plans, tracking code, or tags. Heap makes sure you don’t miss out on unknown unknowns, so you can get answers in seconds and make decisions faster.