Making the PostgreSQL pgloader utility more Citus aware, a database hackathon project

This year, as part of the Citus Data annual team retreat and hackathon, we decided to get ambitious and try to create an easy, 1-click migration to the Citus distributed database. My team was lucky enough to get Dimitri Fontaine, the author of the excellent PostgreSQL pgloader utility, so we decided to start our hackathon project by building on top of pgloader and making pgloader more Citus-aware.

For those of you not familiar with Citus—Citus is an extension to PostgreSQL that transforms Postgres into a distributed database. We make Citus available as open source (star us on GitHub!), as enterprise software you can run anywhere, and as a fully-managed database as a service.

Our Citus hackathon project—making the PostgreSQL pgloader utility more Citus-aware

Before the 2018 Citus Data developer hackathon, here’s what the PostgreSQL pgloader utility could do:

  1. Migrate from MySQL, SQL Server, and SQLite to PostgreSQL
  2. Migrate from database files in DBF or IXF formats to PostgreSQL
  3. Migrate from CSV or fixed-file formats to PostgreSQL
  4. Unlike the PostgreSQL \copy command, it doesn’t stop and roll back when there’s a bad line.

The pgloader code itself is well-maintained and fortunately we had a good bit of Lisp knowledge on our team, so we were able to extend the Postgres pgloader utility to add native understanding of distributed and reference tables and data relationships.

Here’s what the current incarnation of this looks like if you’re using our example ad analytics app from our Citus database documentation:

load database
  from pgsql:///hackathon
  into pgsql://localhost:9700/dim

  with include drop, reset no sequences

  distribute companies using id
  distribute campaigns using company_id
  distribute ads using company_id from campaigns
  distribute clicks using company_id from ads, campaigns
  distribute impressions using company_id from ads, campaigns;

You’ll note several cool features here:

pgloader understands that it needs to distribute several tables, so the data gets passed to the workers instead of filling up the coordinator If you have a table that needs to have the distribution key added, it can happen automatically on the fly based off of data in a table you specify You don’t have to distribute all of your tables, so you can still do a staged migration to Citus if you prefer.

We were pretty proud of this to begin with, but the competition usually gets pretty fierce at these, so we felt that we had to step up our game a bit.

Recently, we’ve seen a lot of multi-tenant SaaS applications built on Rails that use our activerecord-multitenant gem, so one of the questions that came up in our brainstorming session was whether Rails and activerecord have any quirks that we could take advantage of, to automate more parts of the process.

We came up with a few interesting candidates, but the one that really stood out to us was foreign keys.

For those of you that haven’t looked into this in detail, activerecord will automatically add appropriate foreign keys when you use belongs_to, has_one, references, or many other commonly used associations. This means that, hidden in your database, there’s a whole lot of information we can use to put together a map of the data hierarchy, and, well, that was just too useful to ignore.

Thus was born the 2nd half of our Citus hackathon project: automatic distribution.

Now, all you need to do is pick one database table to distribute and the column to use, and pgloader walks the foreign key dependencies for you, figures out what other tables it can distribute with the same key, adds it if it needs to, and distributes everything. You’re on Citus in one command.

I’m just going to repeat that “in one command” bit here so it’s clear that’s not a typo :)

Here’s all you need for the updated pgloader syntax if you’ve got foreign keys in place:

load database
   from pgsql:///hackathon
   into pgsql://localhost:9700/dim

   with include drop, reset no sequences

   distribute companies using id;

So much fun creating a one-click database migration as part of the hackathon

All in all, we’re pretty excited with how our hackathon project turned out. It passed the tests we’ve done so far, so we’ve started cleaning up the quick and dirty code from the hackathon and Dimitri already added it to the repo. We’re looking forward to seeing how the Citus-ready pgloader works in a real-world environment, so if you end up trying it out, let us know what you think on our Citus public Slack!