Announcing Citus 5.1, Improving your data ingest speed and experience

Written by Sumedh Pathak
May 20, 2016

At Citus we want to make dealing with large amounts of operational and analytical workloads easier. Data ingestion speed is key, being the necessary first step in working with any new database. Moreover ingestion is something you'll do repeatedly in testing and development so the bulk-loading user experience is important as well. With the release of Citus 5.1 the experience in loading data is much better all around, and we’ve managed to sneak in a few other improvements as well. Read more below or give it a try today.

A better data loading experience

The most frequent usability issues we encountered were the use of custom scripts and clients for data ingest. To resolve this, we decided to directly hook into PostgreSQL's COPY command, and thus provide the same access and familiar features it provides. You can now use both the server side COPY and the client side \COPY for all types of distribution methods in Citus. An example of the neat features this unlocks, is using the COPY from PROGRAM option, which can allow you to unzip a file and directly pipe it to COPY instead of writing it back to disk.

In addition to the above, performance for loading data is significantly improved in a number of ways:

  • COPY for hash-distributed tables is now orders of magnitude faster than the older copytodistributed_table script. While the previous script topped out at 60-80K inserts/second, we've seen ingest up to a million inserts/second with COPY.
  • COPY for append-distributed tables now 2x faster than previous custom STAGE command.
  • Fast shard-pruning path for improved INSERT planning performance. We can support thousands of shards with minimal planning overhead during an INSERT.

Better insights with EXPLAIN

Another key new feature in 5.1 is the ability to perform an EXPLAIN on distributed queries. You can get more insight into how your queries on distributed tables are planned and distributed by Citus. Along with our ability to propagate index creation statements, you can easily measure and tune distributed query performance.

You can see the changelog for a more comprehensive view of the changes or take a look at upgrading today. If you have any questions or feedback we’d love to hear from you in our Slack channel or on our mailing list.

Sumedh Pathak

Written by Sumedh Pathak

Former principal engineer on the Postgres team at Microsoft. Co-founder & VP of Engineering at Citus Data. Speaker at QCon London & DataEngConf SF. M.S. Computer Science Stanford. Family. Tennis ball. Dog.