Parallel data loading for pg_shard

Written by Marco Slot
May 22, 2015

The pg_shard extension helps you scale out PostgreSQL for large data sets by transparently distributing tables across many nodes, storing (replicated) shards of the data as regular PostgreSQL tables on worker nodes. With pg_shard, you can perform INSERT, UPDATE, DELETE, SELECT on a distributed table as if you were using a regular table.

While first-class COPY support is on the horizon, pg_shard already comes with a copy_to_distributed_table tool for loading data files, which supports the same data formats as COPY. When using this tool, pg_shard creates a temporary 'insert proxy' table which has a trigger that INSERTs into the distributed table. The copy_to_distributed_table script copies a file into the insert proxy.

Below we give an example of loading customer reviews data into the example table on the pg_shard github page using the copy_to_distributed_table script.

copy_to_distributed_table -C customer_reviews_1998.csv customer_reviews

Some users have noticed poor performance when performing many INSERTs over a single connection and the same applies to copy_to_distributed_table. The reason is simple: pg_shard has to wait for a network round-trip on every INSERT. Since there is no notion of parallelism within a single PostgreSQL connection, consecutive INSERTs on the same connection will have to wait. However, pg_shard can still handle more INSERTs on other connections while waiting.

When loading a large data file, it's therefore recommended to split the file into multiple chunks first and use xargs -P 64 to load the chunks concurrently. The commands below have the same effect as the command above, but run much faster.

mkdir chunks
split -n l/64 customer_reviews_1998.csv chunks/
find chunks/ -type f | xargs -n 1 -P 64 sh -c 'echo $0 `copy_to_distributed_table -C $0 customer_reviews`'

On a c3.8xlarge EC2 instance, parallel data loading is roughly 40x faster and inserts around 50,000 rows per second in the above example, but there are many ways of scaling out pg_shard even further. Pg_shard can also be used in combination with CitusDB for bulk data loading and parallel querying.

Marco Slot

Written by Marco Slot

Former lead engineer for the Citus database engine at Microsoft. Speaker at Postgres Conf EU, PostgresOpen, pgDay Paris, Hello World, SIGMOD, & lots of meetups. Talk selection team member for Citus Con: An Event for Postgres. PhD in distributed systems. Loves mountain hiking.

@marcoslot marcocitus