Faster bulk loading in Postgres with copy

If you've used a relational database, you understand basic INSERT statements. Even if you come from a NoSQL background, you likely grok inserts. Within the Postgres world, there is a utility that is useful for fast bulk ingestion: \copy. Postgres \copy is a mechanism for you to bulk load data in or out of Postgres.

First, lets pause. Do you even need to bulk load data and what's it have to do with Citus? We see customers leverage Citus for a number of different uses. When looking at Citus for a transactional workload, say as the system of record for some kind of multi-tenant SaaS app, your app is mostly performing standard insert/updates/deletes.

But when you're leveraging Citus for real-time analytics, you may already have a separate ingest pipeline. In this case you might be looking at event data, which can be higher in volume than a standard web app workload. If you already have an ingest pipeline that reads off Apache Kafka or Kinesis, you could be a great candidate for bulk ingest.

Back to our feature presentation: Postgres \copy. Copy is interesting because you can achieve much higher throughput than with single row inserts.

Postgres \copy:

Can be executed from within psql or an application language
Supports CSV as well as binary data
Is transactional

A look at Postgres \copy performance for bulk ingest

Let's take a look at both how Postgres copy performs over single row inserts, then we'll move onto how you can leverage \copy. There are essentially 3 primary ways to scale out your writes in single-node Postgres. And if you need to scale out ingest even further beyond a single node you can look to Citus as each node in a Citus database cluster is able to help you at scaling writes.

Starting simple with inserts, using sample GitHub event data

First let's see how we perform on a standard insert.

To perform this test, I created a set of 1,000,000 inserts from the GitHub event dataset. I then connected to Postgres with psql and ran \i single_row_inserts.sql. This command executed all the insert queries.

The result: it took 15 minutes 30 seconds to load up 1 million events records. Or right at 1,075 inserts per second on a small-size Postgres instance. Yes, I could have had a few more writers going at once and further tuned my test, but this gives us a starting baseline to compare against.

Performance tuning our inserts

It is of note here that each insert is a transaction. What this means is Postgres is doing some extra coordination to make sure the transaction is completed before returning. On every single write this takes some overhead. Instead of single row transactions, if we wrap all of our inserts in a transaction like below, we'll see some nice performance gains:

begin;
insert 1;
insert 2;
insert 3;
...
commit;

This took my inserts down from 15 minutes 30 seconds to 5 minutes and 4 seconds. We've suddenly boosted our throughput by 3x to about 3k inserts per second.

Bulk ingest with even greater performance ft. Postgres \copy

By batching our inserts into a single transaction, we saw our throughput go higher. But hold on, there is even more we can do. The \copy mechanism gives a way to bulk load data in an even more performant manner. Here instead of raw insert statements, we need the data in some CSV format. Copy accepts a number of delimiters and even forms of binary data as well, but a standard CSV should work. Once we have our CSV of events we can load it up from within Postgres:

\copy github_events from 1millionevents.csv WITH csv;

Running this \copy completes in 82 seconds! We're now processing over 10k writes per second on some fairly modest hardware.

But psql?

Okay, so those of you who are frequent repeaters of our Citus Data blog know that I love psql and the command line utilities in Postgres, but the CLI may not be helpful directly in your application. Fortunately, the copy utilities are available in most languages. Here's the quick rundown for a few popular languages:

Leveraging bulk loading in Ruby

Within Ruby you'll need to be using the standard pq driver or ActiveRecord. The following gives you a rough sketch for establishing a connection from within ActiveRecord and loading from CSV on disk:

  conn = ActiveRecord::Base.connection
  rc = conn.raw_connection
  rc.exec("COPY my_table (col1, col2, etc) FROM STDIN WITH CSV")

  file = File.open('myfile.csv', 'r')
  while !file.eof?
    rc.put_copy_data(file.readline)
  end

  rc.put_copy_end

Leveraging bulk loading in Python

Of course anything Ruby can do, Python (can do better ;) ). In Python, leverage a similar process as we did in ruby. With this code example we’re going to generate a CSV live though as opposed to reading one from disk. In production in both Ruby and Python you could parse the CSV out from disk, or generate it live from some other part of your pipeline like Kafka/Kinesis.

conn = psycopg2.connect(DSN)
curs = conn.cursor()

data = StringIO.StringIO()
data.write('\n'.join(['1', '2','3']))
data.seek(0)

curs.copy_from(data, 'my_table')

Go faster with Postgres \copy, and even faster with Citus

Bulk ingestion with \copy is great for a lot of workloads anddoesn't require you to load up millions of record CSVs either. You can start with micro-batches of 100, 1000, or 10000 records and still see significant performance gains.

Even better, if you're running into limits on a single-node Postgres instance, \copy is supported out of the box with Citus and gets parallelized so you can see even higher throughputs. Need help ingesting large amounts of data? Contact my team here at Citus and we'd be happy to chat.

Written by Craig Kerstiens

Former Head of Cloud at Citus Data. Ran product at Heroku Postgres. Countless conference talks on Postgres & Citus. Loves bbq and football.