COPY into distributed PostgreSQL tables, up to ~7M rows/sec

Written by Marco Slot
June 15, 2016

In the recent 5.1 release of the Citus extension for PostgreSQL, we added the ability to use the COPY command to load data into distributed tables. PostgreSQL's COPY command is one of the most powerful bulk loading features of any database and it is even more powerful on distributed tables.

To get a sense of the achievable ingestion rate for COPY on distributed tables, I set up a Citus cluster on EC2 with a c4.8xlarge master node and 4 i2.4xlarge workers, if you want to get started more quickly you can use Citus Cloud to immediately provision a fully managed Citus cluster on top of AWS. I used the US names dataset containing ~5.7 million rows and created a distributed table as follows:

CREATE TABLE names (
  state text,
  gender text,
  year int,
  name text,
  number int
);
SELECT master_create_distributed_table('names','name','hash');
SELECT master_create_worker_shards('names',16,1);

To get the best ingestion rate, I split the input into 16 chunks and loaded them in parallel using the following commands:

split -n l/16 names.csv chunks/
find chunks/ -type f | time xargs -n1 -P16 sh -c "psql -c \"\\COPY names FROM '\$0' WITH (FORMAT CSV)\""

I repeated the the data loading command multiple times for different set-ups, local/distributed, with/without replication, and with/without an index on the year, and put the results in the table below.

| Table type  | Index | Replication | Ingestion rate |
|-------------|-------|-------------|----------------|
| Local       | No    | N/A         | 1.5M rows/sec  |
| Local       | Yes   | N/A         | 0.5M rows/sec  |
| Distributed | No    | No          | 7M rows/sec    |
| Distributed | No    | Yes         | 5M rows/sec    |
| Distributed | Yes   | No          | 2.5M rows/sec  |
| Distributed | Yes   | Yes         | 1.2M rows/sec  |

In the fastest set-up, Citus loaded up to 7 million rows/sec (finished in under a second). In the most realistic set-up, with replication and an index on the distributed table, COPY still achieves over 1M rows/sec for the names dataset. We recently saw a sustained 1.5M rows/sec in a production cluster, loading 10 billions of rows in a few hours.

Ingestion rate depends on many factors such as number of columns, data types, hardware, indexes, and benchmarks are unlikely to be representative of your use-case. Nonetheless, these numbers show that COPY, even with sharding, replication and indexes, can achieve a write throughput that rivals that of very large NoSQL clusters. At the same time, Citus can do both low latency lookups and parallel analytical queries, for example:

  # SELECT name,
           count(*)
    FROM names
    WHERE year >= 2000
    GROUP BY name
    ORDER BY 2 DESC LIMIT 5;

    name  | count
  --------+-------
  Riley  |  1612
  Jordan |  1573
  Avery  |  1557
  Taylor |  1517
  Peyton |  1497
  (5 rows)

Time: 118.176 ms (560 ms on regular Postgres, 2 seconds on BigQuery)

The combination of fast ingestion and fast queries makes Citus very suitable for applications such as real-time, interactive analytics.

In an upcoming post, we'll look at a way of scaling out COPY on Citus horizontally, allowing arbitrarily high ingestion rates.

Marco Slot

Written by Marco Slot

Former lead engineer for the Citus database engine at Microsoft. Speaker at Postgres Conf EU, PostgresOpen, pgDay Paris, Hello World, SIGMOD, & lots of meetups. Talk selection team member for Citus Con: An Event for Postgres. PhD in distributed systems. Loves mountain hiking.

@marcoslot marcocitus