POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Craig Kerstiens
June 7, 2018
Earlier this week as I was waiting to begin a talk at a conference, I chatted with someone in the audience that had a few questions. They led off with this question: is Citus a good fit for X? The heart of what they were looking to figure out: is the Citus distributed database a better fit for analytical (data warehousing) workloads, or for more transactional workloads, to power applications? We hear this question quite a lot, so I thought I'd elaborate more on the use cases that make sense for Citus from a technical perspective.
Before I dig in, if you're not familiar with Citus; we transform Postgres into a distributed database that allows you to scale your Postgres database horizontally. Under the covers, your data is sharded across multiple nodes, meanwhile things still appear as a single node to your application. By appearing still like a single node database, your application doesn't need to know about the sharding. We do this as a pure extension to Postgres, which means you get all the power and flexibility that's included within Postgres such as JSONB, PostGIS, rich indexing, and more.
Once upon a time (when Citus Data was first started ~7 years ago), we focused on building a fast database to power analytics. Analytical workloads often had different needs and requirements around them. Transactions weren't necessarily needed. Data was often loaded in bulk as opposed to single row inserts and updates. Analytics workloads have evolved and moved from pure-OLAP to a mix of OLTP and OLAP, and so Citus has too.
Going down the data warehousing rabbit hole, you'll find use cases for storing hundreds of terabytes of data. This can range from historical audit logs, analytics data, to event systems. Much of the data is seldomly accessed, but one day you may need it so you want to retain it. This data is typically used internally by a data analyst that has some mix of defined reports (maybe they run them monthly) as well as ad-hoc exploration. For their reporting needs these can often be run in batch and aren't interactive or user facing.
Data warehouses can be single node or multiple node, but almost always are a massively parallel processing (MPP) system. Often times they have columnar storage (which we'll get into in a future post) that for their method of bulk ingestion and minimizing storage cost can work quite well. The design of most data warehouses though leads them to not be ideal for powering anything user facing as they typically have fairly low concurrency.
On the other side of the coin is a traditional transactional database. A transactional database is what you typically use when you start a Rails application and connect your app to a database for storage. Postgres falls squarely into this category. Side note: at times, Postgres has been forked and modified to meet data warehousing use cases, due to its favorable licensing and solid code base.
The standard CRUD (create, read, update, delete) operations that users perform within your application are the textbook definition of OLTP (online transaction processing). There are certain expectations that come with OTLP such as transactional guarantees for your data and an ability to abort/rollback transactions.
It isn't a requirement for an OLTP database that it speaks SQL, though most transactional databases do support at least some version of the SQL standard and thus tend to play well with most application frameworks.
Ten years ago, the world was black and white, you either had a database that was OLTP or OLAP. With the rise of software, there is more data than ever before. With the increase in data, many teams see an increase in the desire to provide value and analytics directly on top of transactional data in real-time. This new middle ground is often referred to as HTAP coined by Gartner. Per their definition:
Hybrid transaction/analytical processing (HTAP) is an emerging application architecture that "breaks the wall" between transaction processing and analytics. It enables more informed and "in business real time" decision making.
Real-time analytics, or HTAP, you'll often find powering customer-facing dashboards, monitoring network data for security alerting, in high-frequency trading environments.
When we first started building Citus, we began on the OLAP side. As time has gone on Citus has evolved to have full transactional support, first when targeting a single shard, and now fully distributed across your Citus database cluster.
Today, we find most who use the Citus database do so for either:
Have questions about whether Citus can help with your use case? Just let us know and we’d be happy to talk and explore if your use case would be a good fit or not.