If you want to learn more about Citus on Microsoft Azure, read this post about Hyperscale (Citus) on Azure Database for PostgreSQL.
Heap Powers Web & Mobile Analytics on Massive Amounts of Data with Citus & Postgres
The creators of Heap had a vision: to capture and analyze every single user action—each click, tap, swipe, form submission, or page view—on their customers’ web and mobile applications. So, when questions inevitably arose about user activity, the answers would be just seconds away and Heap’s customers could make business decisions on the spot.
Heap needed a database that could deliver lightning-fast query results—at scale
Heap wanted to track all the data and deliver lightning-fast query results. Similar solutions tended to track a handful of activities to avoid compromising performance. As a result, before Citus, when customers had questions about untracked activities, an engineer had to log the new activity, resulting in additional costs and lengthy delays while the data was collected.
The Heap development team knew success was contingent on their ability to deliver performance at scale. The team needed to build the real-time, interactive analytics tool around a horizontally scaled clustered database from the ground up. The database also needed to leverage a flexible relational data model to ensure it could accommodate current and future analytical features.
“Our biggest requirement was the ability to scale horizontally and perform analysis fast for large customers. We also wanted to be able to express a wide variety of analyses. This was in the early days of our product and we had some analytical features we wanted to include,” says Dan Robinson, CTO at Heap. “But we also wanted to make sure it would be easy to add new ones because we knew we were going to be adding a lot over the next couple years. And that would require a flexible relational data model.”
NoSQL databases couldn’t meet Heap’s technical requirements
Heap evaluated a number of modern NoSQL databases including HBase, Druid, Cassandra, and Redshift. The team ultimately chose the Citus database because of its Postgres lineage and the ability to easily scale out a database cluster of commodity servers.
Thanks to Citus, we’re powering a product that is really kind of magical for a lot of customers. The speed and performance of our database make it possible for them to immediately perform truly advanced analytics on any user activities our customers want to explore.Dan Robinson, CTO at Heap
The fact that Citus is built around Postgres and is open source checked a lot of stability boxes for Heap
“This was going to be our system of record—our single source of truth—so it also needed to be a system we trusted from a stability and durability point of view,” says Robinson. “The fact that Citus is built around Postgres checked a lot of those boxes.”
It also helped that Citus Data is open source. “When you pick a data layer, you’re making a very long-term investment,” says Robinson. “If the data system is open source, like Citus is, then you can be confident you could support the tool in the future if absolutely necessary.”
Heap runs insanely complex queries on large data sets in <5 seconds with Citus
Thanks to Citus, Heap automatically captures all data from over 6,000 customers. And all of this information is immediately available to Heap’s customers to query for deep insights.
With the Citus database, Heap can offer their customers interactive queries. “You can run a really complex analysis in Heap on a large data set in five seconds,” says Robinson. “Our median query response time is on the order of 500 milliseconds.”
Under the hood, Citus has enabled Heap to cost-effectively scale its Postgres database horizontally across a 70-node cluster that shares approximately 34 TB of memory. Each commodity server in the Citus distributed database cluster has 64 CPU cores and 15.2 TB solid state drives (SSDs). Heap now has approximately 700 TB of data on disk (after compressing 1.4 PB of data with ZFS) and upwards of 700 billion events. Citus has also enabled Heap to maintain a mostly relational query model. Citus provides all the benefits of a traditional Postgres database without the single-node restriction, and facilitates lightning-fast analytics.
Citus makes it possible to represent your data in creative and flexible ways, and gives you a lot of power in how you index your data set. Because Citus is built on Postgres, we can create endlessly complicated ways of representing and indexing the data set, which has given us a ton of mileage over the last couple years.Dan Robinson, CTO at Heap
With Citus, Heap can create complicated ways of representing & indexing their data sets
With Citus, Heap is able to offer customers very advanced query capabilities. For example, an analyst can define a cohort as “users who have uploaded a photo and have logged in three times in the last week and accepted a friend request in the last month,” and then filter a conversation funnel for people in that cohort.
“Citus makes it possible to represent your data in creative and flexible ways, and gives you a lot of power in how you index your data set. Because Citus is built on Postgres, we can create endlessly complicated ways of representing and indexing the data set, which has given us a ton of mileage over the last couple years,” says Robinson. “That’s a rare thing as these big data systems go. That’s not something you can do in a column store. Your options with column stores are just so much more limited.”
Robinson added, “Thanks to Citus, we’re powering a product that is really kind of magical for a lot of customers. The speed and performance of our database make it possible for them to immediately perform truly advanced analytics on any user activities our customers want to explore. All the data is already there.”
The database team of Postgres experts—one of the best parts of Citus
In addition to using the Citus database, Heap worked with the Citus database team to have custom extensions written related to funnel computations, and behavior and retention analysis. “From the start, Citus has been extremely responsive to our needs,” says Robinson. “Our custom extensions have unlocked very powerful new features for our customers. It may have been possible to do some of the same analysis without these extensions, but it would definitely have been a lot slower.”
Robinson adds, “Our experience working with the Citus Data team is one of the best parts of working with Citus. The team is smart, deeply technical, and very competent.”
Our product is a huge and complicated wrapper around a Citus database cluster. With the wrong database, we very likely would’ve failed. With Citus, we are able to rapidly scale our business, and the response from our customers has been fantastic.Dan Robinson, CTO at Heap
Citus does what other databases can’t
Robinson admits Citus meets a very specific set of requirements, but in doing so, Citus does what other databases can’t.
“If I need high transaction volume and a nearly real-time data set, then that rules out a lot of off-the-shelf data warehouses. Delivering a real-time product with most warehouses would require building a complicated real-time layer on top of it—a lambda architecture—and that's going to be a lot of work,” says Robinson.
Today, Citus is critical to Heap’s operations. All of Heap’s data for all of its customers lives in a Citus database cluster, which also powers all of Heap’s analytics. “Our product is a huge and complicated wrapper around a Citus cluster. With the wrong database, we very likely would’ve failed,” says Robinson. “With Citus, we are able to rapidly scale our business, and the response from our customers has been fantastic.”
Dan Robinson, CTO at Heap
Thanks to Citus, we’re powering a product that is really kind of magical for a lot of customers. The speed and performance of our database make it possible for them to immediately perform truly advanced analytics on any user activities our customers want to explore.
With analytics infrastructure from Heap, organizations can auto-track the entire customer journey. Heap automatically captures every web, mobile, and cloud interaction: clicks, submits, transactions, emails, and more. So you can retroactively analyze your data, without writing code.