Compressing PostgreSQL JSONB data 6x using cstore_fdw

Written by Marco Slot
May 5, 2015

In this post I will present my findings that show a 6x compression of PostgreSQL JSONB data using the cstore_fdw extension. Cstore_fdw is an open source PostgreSQL extension for reducing the storage footprint and speeding up analytical queries. Cstore_fdw is maintained by Citus Data and is available on GitHub.

The JSONB data type is one of the major additions to PostgreSQL in the 9.4 release and also in CitusDB 4.0. JSONB gives the ability to store and query JSON documents in PostgreSQL without having to parse them at run-time. This addition instantly made PostgreSQL one of the most powerful document databases, introducing new possibilities for unstructured and big data use cases. However, there is a significant cost to using JSONB: Storage space. Even when using JSONB, the keys in the JSON are repeated in every row, which creates a significant storage overhead (e.g., 300%) compared to regular tables. So what if we compressed JSONB data using our cstore_fdw extension?

For this experiment, we set up 4 "customer_reviews" tables using different storage methods.

  • A regular table (12 columns)
  • A cstore_fdw table (12 columns)
  • A regular table with a single JSONB column
  • A cstore_fdw table with a single JSONB column

We loaded example CSV and JSON data into the tables using COPY and measured the table size using pg_relation_size and cstore_table_size. We found that using JSONB with cstore_fdw reduced the storage size by over 6 times compared to ordinary JSONB, and it's even much smaller than a regular table.

storage size of PostgreSQL tables bar chart

The reduced storage size does come at a cost. Data loading time goes up a bit because of the time spent compressing the data. Below are the load times on a c3.large EC2 instance.

Data load time into PostgreSQL tables bar chart

Since the data needs to be decompressed while running queries, query times go up as well. However, the good news is that they do not go up by nearly as much as the storage space goes down and CPU time can be reduced by scaling out using CitusDB. Below are the average query times over 10 runs on a c3.large EC2 instance for the following example query from the cstore_fdw github page or the JSONB equivalent:

  SELECT width_bucket(length(product_title), 1, 50, 5)
         title_length_bucket,
         round(avg(review_rating), 2) AS review_average, 
         count(*)
    FROM customer_reviews
   WHERE product_group = 'Book'
GROUP BY title_length_bucket
ORDER BY title_length_bucket;

Using JSONB with cstore_fdw significantly reduces the cost of storage, while fitting much more data into memory. The benefits of compression will really come to light when queries are I/O-bound. In the coming weeks, we will do a follow-up post in which we discuss more examples, tuning of compressed JSONB tables, and parallelizing queries using CitusDB.

Marco Slot

Written by Marco Slot

Former lead engineer for the Citus database engine at Microsoft. Speaker at Postgres Conf EU, PostgresOpen, pgDay Paris, Hello World, SIGMOD, & lots of meetups. Talk selection team member for Citus Con: An Event for Postgres. PhD in distributed systems. Loves mountain hiking.

@marcoslot marcocitus