Earlier this week as I was waiting to begin a talk at a conference, I chatted with someone in the audience that had a few questions. They led off with this question: is Citus a good fit for X? The heart of what they were looking to figure out: is the Citus distributed database a better fit for analytical (data warehousing) workloads, or for more transactional workloads, to power applications? We hear this question quite a lot, so I thought I’d elaborate more on the use cases that make sense for Citus from a technical perspective.
Before I dig in, if you’re not familiar with Citus; we transform Postgres into a distributed database that allows you to scale your Postgres database horizontally. Under the covers, your data is sharded across multiple nodes, meanwhile things still appear as a single node to your application. By appearing still like a single node database, your application doesn’t need to know about the sharding. We do this as a pure extension to Postgres, which means you get all the power and flexibility that’s included within Postgres such as JSONB, PostGIS, rich indexing, and more.
OLAP - Data warehousing as it was 5 years ago
Once upon a time (when Citus Data was first started ~7 years ago), we focused on building a fast database to power analytics. Analytical workloads often had different needs and requirements around them. Transactions weren’t necessarily needed. Data was often loaded in bulk as opposed to single row inserts and updates. Analytics workloads have evolved and moved from pure-OLAP to a mix of OLTP and OLAP, and so Citus has too.
Data warehousing for slower storage and massive exploration
Going down the data warehousing rabbit hole, you’ll find use cases for storing hundreds of terabytes of data. This can range from historical audit logs, analytics data, to event systems. Much of the data is seldomly accessed, but one day you may need it so you want to retain it. This data is typically used internally by a data analyst that has some mix of defined reports (maybe they run them monthly) as well as ad-hoc exploration. For their reporting needs these can often be run in batch and aren’t interactive or user facing.
Data warehouses can be single node or multiple node, but almost always are a massively parallel processing (MPP) system. Often times they have columnar storage (which we’ll get into in a future post) that for their method of bulk ingestion and minimizing storage cost can work quite well. The design of most data warehouses though leads them to not be ideal for powering anything user facing as they typically have fairly low concurrency.
OLTP - Your transactional system of record
On the other side of the coin is a traditional transactional database. A transactional database is what you typically use when you start a Rails application and connect your app to a database for storage. Postgres falls squarely into this category. Side note: at times, Postgres has been forked and modified to meet data warehousing use cases, due to its favorable licensing and solid code base.
The standard CRUD (create, read, update, delete) operations that users perform within your application are the textbook definition of OLTP (online transaction processing). There are certain expectations that come with OTLP such as transactional guarantees for your data and an ability to abort/rollback transactions.
It isn’t a requirement for an OLTP database that it speaks SQL, though most transactional databases do support at least some version of the SQL standard and thus tend to play well with most application frameworks.
The rise of HTAP workloads, both transactional and analytical
Ten years ago, the world was black and white, you either had a database that was OLTP or OLAP. With the rise of software, there is more data than ever before. With the increase in data, many teams see an increase in the desire to provide value and analytics directly on top of transactional data in real-time. This new middle ground is often referred to as HTAP coined by Gartner. Per their definition:
Hybrid transaction/analytical processing (HTAP) is an emerging application architecture that “breaks the wall” between transaction processing and analytics. It enables more informed and “in business real time” decision making.
Real-time analytics, or HTAP, you’ll often find powering customer-facing dashboards, monitoring network data for security alerting, in high-frequency trading environments.
Okay so where does the Citus database fit in? Is it OLAP, OLTP, or HTAP?
When we first started building Citus, we began on the OLAP side. As time has gone on Citus has evolved to have full transactional support, first when targeting a single shard, and now fully distributed across your Citus database cluster.
Today, we find most who use the Citus database do so for either:
- (OLTP) Fully transactional database powering their system of record or system of engagement (often multi-tenant)
- (HTAP) For providing real-time insights directly to internal or external users across large amounts of data.
Have questions about whether Citus can help with your use case? Just let us know and we’d be happy to talk and explore if your use case would be a good fit or not.