Parallel data loading for pg_shard

The pg_shard extension helps you scale out PostgreSQL for large data sets by transparently distributing tables across many nodes, storing (replicated) shards of the data as regular PostgreSQL tables on worker nodes. With pg_shard, you can perform INSERT, UPDATE, DELETE, SELECT on a distributed table as if you were using a regular table.

While first-class COPY support is on the horizon, pg_shard already comes with a copy_to_distributed_table tool for loading data files, which supports the same data formats as COPY. When using this tool, pg_shard creates a temporary 'insert proxy' table which has a trigger that INSERTs into the distributed table. The copy_to_distributed_table script copies a file into the insert proxy.

Below we give an example of loading customer reviews data into the example table on the pg_shard github page using the copy_to_distributed_table script.

copy_to_distributed_table -C customer_reviews_1998.csv customer_reviews

Some users have noticed that pg_shard performs poorly when performing many INSERTs over a single connection and the same applies to copy_to_distributed_table. The reason is simple: pg_shard has to wait for a network round-trip on every INSERT. Since there is no notion of parallelism within a single PostgreSQL connection, consecutive INSERTs on the same connection will have to wait. However, pg_shard can still handle more INSERTs on other connections while waiting.

When loading a large data file, it's therefore recommended to split the file into multiple chunks first and use xargs -P 64 to load the chunks concurrently. The commands below have the same effect as the command above, but run much faster.

mkdir chunks
split -n l/64 customer_reviews_1998.csv chunks/
find chunks/ -type f | xargs -n 1 -P 64 sh -c 'echo $0 `copy_to_distributed_table -C $0 customer_reviews`'

On a c3.8xlarge EC2 instance, parallel data loading is roughly 40x faster and inserts around 50,000 rows per second in the above example, but there are many ways of scaling out pg_shard even further. Pg_shard can also be used in combination with CitusDB which has powerful built-in bulk loading and query parallelization features for distributed tables.

Webinar on PostgreSQL Real-Time Analytics & Terabyte-Scale Data Ingestion on May 20th

Learn how to enable real-time analytics on terabytes of data using PostgreSQL by combining analytics and operational workloads in a single database. Join Utku Azman from Citus Data at 10:00 AM PDT on Wednesday, May 20, 2015 for a webinar which will discuss the challenges of real-time big PostgreSQL and present a case study of a company which has solved this challenge. You can register now by visiting the Citus Data Webinars page.

In this live webinar, you will hear how users can differentiate themselves in a competitive environment by leveraging the rich and flexible PostgreSQL ecosystem. You will become familiar with the concepts and methods employed to enable processing of billions of events in real-time using PostgreSQL-based solutions. You will also hear about real-life implementations including how CloudFlare uses CitusDB to power real-time analytics for millions of websites on tens of terabytes of data.

By the end of the webinar, attendees will have a greater understanding of CitusDB features that enable massive parallelization of analytical queries, real-time ingestion on distributed data sets, columnar storage and advanced data compression, and high availability and dynamic scaling. They will also have a better understanding of a real-time analytics use case on big PostgreSQL.

Visit the Citus Data Webinars page now to register.

Page 1 of 17

About

CitusDB is a scalable analytics database that's built on top of PostgreSQL.

In this blog, we share our ideas and experiences on databases and distributed systems.

Connect

Download Software

View a Webinar