Documenting the Citus extension to Postgres: an interview with begriffs

Update in October 2022: Citus has a new home on Azure! The Citus database is now available as a managed service in the cloud as Azure Cosmos DB for PostgreSQL. Azure documentation links have been updated throughout the post, to point to the new Azure docs.

The last two months, I managed the agenda for our weekly Citus team meeting, the one time each week where our entire distributed team—with people spread across 6 different countries—gets together to talk about Citus things. As I chatted with our PostgreSQL folks to find speakers to give 10-minute “lightning talks”, I heard a chorus from several of the engineers: “see if you can get Joe to give a talk. His talks are always super interesting.”

I succeeded. Joe Nelson (known as begriffs online) did deliver a talk titled “Dominus SQL, lord of my domain.” And the engineers liked it. Not a surprise, as Joe’s content tends to be pretty popular, both on his personal blog, and on the Citus Data blog, including high traffic posts such as 5 ways to paginate in Postgres and Faster PostgreSQL Counting.

And when Joe agreed to let me interview him about his work on the Citus documentation (he’s quite busy so I wasn’t sure he would say yes), well, I was thrilled. This post is an edited transcript of my interview with Joe—and it’s your inside baseball view into how the documentation for the Citus open source project gets made.

Documentation = an essential way to share aspects of open source

Claire: You’ve been working on the Citus documentation since you joined Citus back in 2016. Why do you do this type of work? Why documentation?

Joe: Picture of Joe Nelson I think I’m hard-wired to care about documentation. When I work on any kind of project, I want my work to be enjoyed publicly. I know some people might like to sit and carve themselves a rocking chair that only they are going to rock in. And creating that rocking chair gives them a sense of calm satisfaction.

But for me, I'm always thinking that my work is going to be way more fun if other people get to try it and use it. I want to share things. And documentation has always been an essential way for me to share aspects of the open source projects I work on. So other people can know what it's all about—and can jump into the project with me.

I’ve always had this inclination. When I was a kid, I read a book called C for Dummies—and after I finished, I decided I would write a tutorial in a similar style. I wrote a full language guide about QBASIC in ye olde DOS text editor. Probably everything about that project—language, style, and editor—were regrettable, but 13-year-old me thought it was a lot of fun.

Citus docs draw the reader into a world where Postgres is distributed

Claire: Can you give me an overview of the Citus documentation? What is it about?

Joe: The Citus docs start from the assumption that our reader is a database user. So, if you’re reading the Citus documentation, maybe you’re an analyst and you’re running reports—or maybe you’re an application developer. But almost certainly, you’re using a database.

With Citus, we enable PostgreSQL users to take advantage of multiple machines in order to process and safeguard their data, which is important for people whose database has grown too big for a single server to handle economically—or at all.

It’s true that these days you can deploy some pretty large single-server instances in the cloud, but doing so can be expensive, and sometimes even the largest instance is not large enough.

Which means some PostgreSQL users and application developers are hitting problems of scale and growth. Our goal with Citus is to help you scale in a way that allows you to grow your database workloads—while still being able to use relational database features such as constraints and foreign keys and joins.

If you’re reading the Citus docs, you’re not just a database user, you’re almost certainly a PostgreSQL user. And, you probably want to continue to take advantage of relational database semantics—so you don’t have to re-architect your application or throw in the towel with a NoSQL kind of thing, in order to address your challenges with growth.

With the Citus documentation, I want to draw the reader into the world where database tables are distributed across multiple machines.

Relationship between Citus & Postgres docs

Claire: When someone is looking at the Citus docs, do you assume they have already read the Postgres docs?

Joe: Because Citus is an extension to Postgres—an extension that transforms Postgres into a distributed database—a guiding principle is that we don’t want to duplicate anything in the Postgres documentation.

Our Citus docs are full of links to the Postgres docs as needed. We don’t duplicate. Instead we focus on documenting how PostgreSQL features work in the context of a distributed database like Citus. And, of course, we want the Citus docs to show you that by modeling your data right and distributing your data across a cluster in the right way, you can make your application go fast.

Like a jigsaw puzzle

Joe: I think of Citus like a jigsaw puzzle. Imagine you see a puzzle all put-together right in front of you. You can see the picture the puzzle forms, right? Now imagine that the different puzzle pieces are physically stored in different boxes. Your data is distributed across different physical locations—but you (or the database) can still see the whole picture. The Citus extension can still see the whole puzzle picture, even though the different puzzle pieces are physically located on different machines.

Tutorials & use case guides are the most fun parts to write

Claire: Do you have favorites? I mean, of all the different parts of the Citus documentation—is there a part you like best?

Joe: There's definitely a favorite part for me to write. I don't know if I have a favorite in terms of “this is the best part of the Citus documentation,” but the most fun for me to write are pieces that are more hands-on, like our tutorials and use case guides.

Anything where we start from nothing and you get to create something. Walking people through how to put the pieces together and create things is way more fun than writing reference docs.

The Citus docs start with quick tutorials about distributing data and running PostgreSQL queries in the context of a database cluster with multiple machines. I want people to be able to accomplish something and have success right away, within just a few minutes.

The first tutorial I ever created for Citus—based on an example created by a Citus developer named Brian Cloutier—deals with a made-up advertising application that runs an ad campaign for multiple companies. Multi-tenant use cases like that one are fairly common these days. In fact, most SaaS applications are “multi-tenant”, since a single cloud application serves many different customers (also known as ‘tenants’). The idea behind this use-case is that queries for each tenant use their own resources on separate machines and can all happen in parallel.

The second quick tutorial focuses on the real-time analytics use case. Real-time analytics is a common scenario that PostgreSQL users employ with Citus, because it can have demanding performance and concurrency requirements. With real-time analytics applications, you need a database that can ingest large amounts of data in near real-time, while at the same time deliver low-latency query responses to hundreds or thousands of end users.

In the real-time analytics tutorial, we use sample data from GitHub, a data set which is basically a big dump of events such as issues created, PRs opened, and commits pushed. In the tutorial, you get to analyze these GitHub events to compute statistics on that information really fast. For example, you can calculate the number of commits per minute or the top users that are creating repositories.

The Citus tutorials are fun and quick. And to my mind, tutorials are one of the best ways for people to learn how Citus transforms Postgres into a distributed database. Oh, and we also expanded the tutorials into what I call use case guides—I’ve created a use case guide for multi-tenant applications, one for real-time analytics dashboards, and another one for timeseries data. These use case guides dig deeper into features, challenges, and the implications of coordinating between multiple machines, which is intrinsic to a distributed database like Citus.

Get started by downloading open source or by provisioning Hyperscale (Citus) on Azure

Claire: To get started with the Citus hands-on tutorials, do people need to download the Citus extension first?

Joe: Yes and no. Citus is open source and so sure, an obvious place to start for many developers will be to download the Citus packages and install.

But Citus is also available in the cloud. As of last year (2019), Citus is now built into Azure Database for PostgreSQL, which is the fully managed Postgres service running on Microsoft Azure. So some users will jump straight to provisioning a Hyperscale (Citus) server group on Azure Database for PostgreSQL.

If you choose to download Citus open source packages, the Citus installation instructions are divided into two different types: a single-machine cluster and a multi-machine cluster.

The single-machine cluster is useful if you want to get started quickly with Citus, and emulate a database cluster on a development machine. You won’t see a performance advantage with a single-machine cluster—and the single-machine cluster is definitely not intended for production use—but it’s a useful way try Citus. For the single-machine clusters, I’ve written instructions for Docker, Ubuntu, Debian, Fedora, CentOS, and Red Hat.

I’ve also created separate installation instructions for multi-machine clusters, where you will be running separate physical machines and connecting them together. Just like in the single-machine cluster, I have written install instructions for different operating systems like Ubuntu, Debian, Fedora, CentOS, and Red Hat.

When you download and install Citus, the Postgres packages are included as part of the Citus installation. This is because when you run Citus, you’re running a standard Postgres server, with the added functionality of the Citus open source extension on top.

INSERT..SELECT with re-partitioning in Citus 9.2

Claire: I know Marco Slot just blogged about Citus 9.2, the most recent Citus open source release. Is there something you worked on in 9.2 that you want to shine a light on?

Joe: Yeah, the engineering team that works on Citus open source made quite a number of improvements in 9.2.

The feature I think is coolest is called INSERT..SELECT with re-partitioning. The PostgreSQL INSERT..SELECT features allows you to append the results of a query into another table. This is useful for transforming data or pre-aggregating data in interesting ways. Rather than having to insert one data point after another yourself, you can just take everything that a query outputs and put it into another table.

Even before 9.2, Citus already enabled you to do a distributed INSERT…SELECT. The thing that is new in Citus 9.2 is a performance optimization. Under the covers, the Citus worker nodes (also called data nodes) can now shuffle data amongst themselves in more situations, rather than having to collect the query results back up to the coordinator node. So you are no longer bound to a single distribution column—now you can shuffle the data all throughout the nodes, which opens up new possibilities for ingestion and processing. This new feature also means your query results can be up to 5x faster, especially for real-time analytics pipelines.

Open source publishing workflow

Claire: I promised I would ask you about the platform you use to develop the Citus documentation. Because I think other open source projects & startups would be interested.

Joe: When we first started at Citus, we did some experimentation about how our publishing process should work for the docs. What we settled on eventually was an open source platform called Read the Docs.

We use Read the Docs for 3 primary purposes. First, they host the web server for our Citus documentation. They also index our content for searches on the docs pages, to improve the results when someone uses the search box. I like that we didn’t have to build the search capability ourselves. Third, Read the Docs allows us to provide multiple versions of the docs that match the different versions the Citus software.

The versioning capability is important for continuity. Because we still have older versions of Citus out there, you might still be using these older versions, which means you probably also need access to older versions of the docs. As I write corrections, Read the Docs makes it easy for me to apply changes to more than one version. So that's pretty cool, thatversioning is something I don't have to manage myself. There are other do-it-yourself, self-hosted systems, but this was pretty handy.

Internally, Read the Docs uses an open source documentation generator called Sphinx—and that makes it so that when I write documents, I write them using a markup syntax called reSTructuredtext. reSTructuredtext is similar to markdown, but it has more support for maintaining the consistency of internal links.

I should also mention: when I collaborate with Citus developers and with members of our open source community, we collaborate on the citus_docs repo on GitHub. Developers who are working on Citus will create an issue for me, describing the user-facing implications, the use case, and the technical details. I make a pull request in GitHub, and the developers review it.

In creating the publishing process for Citus docs, we also created some infrastructure for ourselves, specifically a pipeline between Heroku and GitHub. Any time someone opens a pull request on GitHub, Heroku will automatically build and stage a temporary version of the docs website. These Heroku apps help my reviewers see doc changes as they would look when published, rather than just reviewing a diff. That's been a really smooth part of the process.

Open source docs make it easier to collaborate

Claire: Are the Citus docs also open source?

Joe: Yeah, the Citus docs are public, and open source. In fact, not only are the finished docs open source, but the discussions about them in GitHub issues are also public. Sometimes I choose not to put every detail from the issues into the docs, but even so, the issues can be a good source for curious people to learn more. The docs are open source in all senses—our contributions are public, and we accept contributions from other people. Someone just sent a GitHub pull request to improve the Citus FAQ, for instance.

Documenting Citus on docs.microsoft.com

Claire: Do you also work on docs.microsoft.com, on the documentation for Hyperscale (Citus) on Azure Database for PostgreSQL?

Joe: Yes I do. The Microsoft docs platform is fairly well evolved. Think about how long Azure and other Microsoft products have been in existence, and how important good documentation is to developers… it’s no surprise that the team at Microsoft has put a lot of resources into documentation. It's pretty cool. There's an automatic grammar checking feature. And pages on Microsoft docs are assigned a content quality score you have to reach—oh, and translations are important on docs.microsoft.com, too. Both machine and human translators are used in order to ensure that the Microsoft docs are fully internationalized.

Microsoft Docs is chock full of useful features that help people like me to create even better documentation. One example of a feature we’ve wanted to add to our Citus open source docs for a while, but just haven’t prioritized, is the ability to give feedback on every page. On docs.microsoft.com, you can express how you feel about the content on each page by giving a thumbs-up or thumbs-down. At the bottom of the page, you can also give feedback, either on the product itself, or on the page—in place or on GitHub.

Back to your original question: the short answer is yes, I also write our Hyperscale (Citus) documentation for Azure Database for PostgreSQL. The good news in terms of being able to leverage knowledge is that the topics in our Citus open source docs apply to Azure too, because Citus is the open source engine inside of Hyperscale (Citus). If you already know how to use the Citus extension to Postgres, you will know the fundamentals of using Hyperscale (Citus) to scale out Azure Database for PostgreSQL.

Future ideas: gamifying the Citus docs

Claire: Ok, last question. Do you have any interesting documentation projects up your sleeve that you can share with us, to give us a peek into what’s coming?

Joe: Yes. I have all sorts of future projects in mind, but I haven't yet begun any actual work on any of them. But I definitely have ideas of things that would be cool for the Citus documentation.

The first idea is something I've always wanted to do, although I’d be equally happy if someone else created it for me. The idea: I’d like to create a sort of game, a distributed PostgreSQL game, where there are levels that you have to beat, and each level gets progressively harder.

In this Citus game you would not be competing against other people, rather you’d be competing against yourself, trying to learn new skills and eek out more performance.

When I think about how the game might work… say you're given a scenario. Perhaps you're told to imagine that you run the GitHub website and you want to count something special, like how many distinct users interacted with the site during a certain hour? This level would have some initial configuration. We’d give you the data in a database, you've got the tables, you’re at the console, and you’ll need to run SQL queries in order to answer the questions we give you. The challenge will be to find queries that are efficient, that deliver low-latency results. We would give you clues to help you find the right steps in the Citus documentation.

With some research, you would be able to beat this level in the game. Perhaps the only way to beat a particular level might be to employ a sketch algorithm, like HyperLogLog, that allows you to do something that is typically quite expensive in terms of resource usage—like a COUNT DISTINCT—with extreme efficiency, even in the face of limited amounts of memory. (Editor’s note: HyperLogLog is itself a Postgres extension and is also open source—and it’s maintained by the Postgres team here at Microsoft.)

Maybe there would be a leaderboard about who got the highest score, maybe not. No matter what, I think it would be super fun to create a Citus documentation game—and hopefully it would prove to be a fun way for new users to learn about Postgres and Citus.

Claire: Thanks so much for talking to me about the Citus open source docs and your work documenting Hyperscale (Citus) on Azure, too. It’s safe to say that all of us on the Postgres team here at Microsoft really appreciate your love for Postgres, your attention to detail, and your commitment to help users learn about Citus.

Written by Claire Giordano

Head of open source community efforts for Postgres at Microsoft. Alum of Citus Data, Amazon, Sun Microsystems, and Brown University CS. Conference speaker at PGConfEU, PGConfdev, FOSDEM, PGConf NYC, Nordic PGDay, pgDay Paris, PGDay Chicago, Swiss PGDay, & Citus Con. Talk Selection Team chair for POSETTE: An Event for Postgres. Loves sailing in Greece.