Citus Unforks From PostgreSQL, Goes Open Source

Written by Ozgun Erdogan
March 24, 2016

Elecorn

When we started working on CitusDB 1.0 four years ago, we envisioned scaling out relational databases. We loved Postgres (and the elephant) and picked it as our underlying database of choice. Our goal was to extend this database to seamlessly shard and replicate your tables, provide high availability in the face of failures, and parallelize your SQL queries across a cluster of machines.

We wanted to make the PostgreSQL elephant magical.

Four years later, CitusDB has been deployed into production across a number of verticals, and received numerous feature improvements with every release. PostgreSQL also became much more extensible in that time–and we learned a lot more about it.

Today, we’re happy to release Citus 5.0, which seamlessly scales out PostgreSQL for real-time workloads. We’re also excited to share two major announcements in conjunction with the 5.0 release!

First, Citus 5.0 now fully uses the PostgreSQL extension APIs. In other words, Citus becomes the first distributed database in the world that doesn't fork the underlying database. This means Citus users can immediately benefit from new features in PostgreSQL, such as semi-structured data types (json, jsonb), UPSERT, or when 9.6 arrives no more full table vacuums. Also, users can keep working with their existing Postgres drivers and tools.

Second, Citus is going open source! The project, codebase, and all open issues are now available on GitHub. We realized that when mentioned Citus and PostgreSQL together to prospective users, they already assumed that Citus was open. After many conversations with our customers, advisors, and board, we are happy to make Citus available for everyone.

To see how to get started with it, let’s take a hands-on look.

Getting Rolling

You can download the extension here. Once you’ve downloaded it you can bootstrap your initial Postgres database with the Citus cluster. From here we can begin using Citus:

CREATE EXTENSION citus;

Now that the extension is enabled you can begin taking advantage. First we’ll create a table, then we’re going to tell Citus to create it as a distributed table and finally we’ll inform it about our shards. If you’re running Citus on a single machine, this will scale queries across multiple CPU cores. and create the impression of sharding across databases.

As an example, which you can find more detail on in our tutorial, we’re going to create a table to capture edits from wikipedia, then shard this table across multiple Postgres instances. First let’s create our table:

CREATE TABLE wikipedia_changes (
  editor TEXT, -- The editor who made the change
  time TIMESTAMP WITH TIME ZONE, -- When the edit was made
  bot BOOLEAN, -- Whether the editor is a bot

  wiki TEXT, --  Which wiki was edited
  namespace TEXT, -- Which namespace the page is a part of
  title TEXT, -- The name of the page

  comment TEXT, -- The message they described the change with
  minor BOOLEAN, -- Whether this was a minor edit (self-reported)
  type TEXT, -- "new" if this created the page, "edit" otherwise

  old_length INT, -- How long the page used to be
  new_length INT -- How long the page is as of this edit
);

Now that we’ve created our table we’re going to tell the Citus extension this is the one we want to shard. *In the case of our demo, we’re going to lower the replication factor to one, since we’re only running 1 worker node*

SET citus.shard_replication_factor = 1;
SELECT master_create_distributed_table( 'wikipedia_changes', 'editor', 'hash' );
SELECT master_create_worker_shards('wikipedia_changes', 16, 1);

You can start inserting data with a standard INSERT INTO and Citus will shard and distribute your data across multiple nodes. If you want a jump start at loading data in check out our tutorial with scripts to help you start loading data automatically from the wikipedia event stream.

It’s that simple to use the fully open source Citus 5.0 extension. Now, let’s take a deeper look at some of the technical details.

What’s Unique about Citus?

Citus uses three new ideas when building the distributed database.

  1. Citus scales out SQL by extending PostgreSQL, not forking it. This way, users benefit from all the performance and feature work done on Postgres over the past two decades, scaled out on a cluster of machines.
  2. Data-intensive applications have evolved over time to require multiple workloads from the database. Citus comes with three distributed executors, recognizing differences across operational (low-latency) and analytic (high-throughput) workloads
  3. Parallelizing SQL queries requires that the underlying theoretical framework is complete. Citus’ distributed query planner uses multi-relational algebra, which is proven to be complete.

These principles help us lay the foundation for a scalable relational database. With that said, we know that we still have more work ahead of us. PostgreSQL is huge, and Citus currently doesn’t support the full spectrum of SQL queries. For details on SQL coverage, please see our FAQ.

A good way to get started with Citus today is to think of it in terms of your use-case.

Common Use Cases 

Citus provides users real-time responsiveness over large datasets, most commonly seen in rapidly growing event systems or with time series data . Common uses include powering real-time analytic dashboards, exploratory queries on events as they happen, session analytics, and large data set archival and reporting.

Citus is deployed in production across multiple verticals, ranging from technology start-ups to large enterprises. Here are some examples:

  • Cloudflare uses Citus to provide real-time analytics on 100 TBs of data from over 4 million customer websites.
  • Neustar builds and maintains a scalable ad-tech infrastructure that analyzes billions of events per day using HyperLogLog and Citus.
  • Agari uses Citus to secure more than 85 percent of U.S. consumer emails on two 6-8 TB clusters.
  • Heap uses Citus to run dynamic funnel, segmentation, and cohort queries across billions of users and tens of billions of events.

As excited as we are to make Citus 5.0 available to everyone, we’d be remiss to not pay attention to those of you who need something more. For customers with large production deployments, we also offer an enterprise edition that comes with additional functionality and commercial support.

In Conclusion

We’re excited to release the latest version of Citus and make it open source. And we’d love to hear your feedback. If you have questions or comments for us, start a thread in our Google Group, join us through the Citus Slack, or open an issue on GitHub.

Ozgun Erdogan

Written by Ozgun Erdogan

Co-founder & CTO of Citus Data. Former Postgres engineering team director at Microsoft. Worked on distributed systems at Amazon. Speaker at PGCon, XLDB Conf, DataEngConf, PostgresOpen, & QCon. Dad.