Nearly 18 months ago, we open sourced our Citus distributed database and “unforked it” from PostgreSQL by refactoring Citus into a PostgreSQL extension. Seasoned PostgreSQL users likely already know of and use popular PostgreSQL extensions, such as hstore, PostGIS, and pg_stat_statements; however, we realized some of you might appreciate a recap of our journey from fork to extension and beyond.
Before Becoming a PostgreSQL Extension
Prior to the release of Citus 5.0 in the spring of 2016, our codebase was best described as a fork of PostgreSQL. The open source license used by PostgreSQL is quite liberal and generously allows for forks, which has led to a long history of systems being built on top of PostgreSQL, including ParAccel, Truviso, Aster Data, and Greenplum. PostgreSQL releases new major versions each year, which are generally backwards compatible from an end-user perspective, but can come with substantial changes in the lower-level PostgreSQL code.
Being a fork of PostgreSQL means needing to adapt to those low-level changes by rebasing atop each new version after it is released, a complex integration process that can easily consume weeks.
Transforming Citus from a Fork to an Extension of Postgres
Fortunately, PostgreSQL exposes many internal hooks to permit the creation of extensions to expand its capabilities and power. Extensions can be developed against the low-level hooks using C or in higher-level procedural languages. The hooks give direct access to the core of PostgreSQL: scans, utility commands, planning, and execution are just a few of the processes that can be modified or entirely overridden by an extension. Extensions can provide new datatypes, better monitoring, foreign data wrappers, advanced security capabilities, and even entirely new languages for writing stored procedures!
Our Citus database leverages a majority of the low-level PostgreSQL hooks available to us: we create custom nodes, use them to build custom scans, and have custom planner and executor lifecycles to actually carry out your distributed queries. We even override processing of DDL commands to help perform schema modifications on remote nodes and have recently added a background worker within our extension, which performs distributed deadlock detection on in-flight queries. That we can do all of this in a modular “add-on” fashion is a testament to the careful design of PostgreSQL’s internals.
Benefits of Being an Extension to PostgreSQL
Because we no longer maintain an entire fork of the complete PostgreSQL codebase, the effort required to remain compatible with new versions of PostgreSQL has been dramatically reduced: after a new PostgreSQL version is released, we need only integrate with changes to the interfaces we use to call out to PostgreSQL. These kinds of changes are often more along the lines of a new parameter has been added to a method, fix the call sites rather than an entire codebase of millions of lines has changed beneath you.
With this reduction in integration effort, we have recently been able to begin supporting major PostgreSQL releases before they land. This even includes support for entirely new features in upcoming releases: for instance, Citus 7 includes some awareness of PostgreSQL 10’s declarative partitioning syntax, even though Citus 7 shipped a full month before PostgreSQL 10 was released. In the past, when Citus was a fork of PostgreSQL, supporting a new PostgreSQL feature like that would have taken months of integration work.
Today: Citus + PostgreSQL = ❤️
All of this history leads us to where we are today at Citus Data. We continue to push forward with new Citus releases chock-full of features to serve all our customers—whether they be open-source, enterprise, or cloud—making it so you don’t have to worry about your database and can get back to building features.
Citus 7—released in September—was our first release which supported a major PostgreSQL version on the day it was released. As soon as the PostgreSQL 10 packages showed up in PGDG, we released OS packages for Citus that were compatible with the PostgreSQL 10 release. The next step: we immediately began builds against PostgreSQL 11 in our continuous integration environment, the earliest we have ever begun building against a new PostgreSQL version.
Admittedly we’re biased when it comes to PostgreSQL: pretty much everyone on the engineering team at Citus Data is a PostgreSQL fan. Still, we’ve been impressed by the capabilities afforded us as an extension to the “world’s most advanced open source database.” And we look forward to the many more impressive features on the PostgreSQL roadmap ahead.