If you want to learn more about Citus on Microsoft Azure, read this post about Hyperscale (Citus) on Azure Database for PostgreSQL.
In this post I will present my findings that show a 6x compression of PostgreSQL JSONB data using the cstore_fdw extension. Cstore_fdw is an open source PostgreSQL extension for reducing the storage footprint and speeding up analytical queries. Cstore_fdw is maintained by Citus Data and is available on GitHub.
The JSONB data type is one of the major additions to PostgreSQL in the 9.4 release and also in CitusDB 4.0. JSONB gives the ability to store and query JSON documents in PostgreSQL without having to parse them at run-time. This addition instantly made PostgreSQL one of the most powerful document databases, introducing new possibilities for unstructured and big data use cases. However, there is a significant cost to using JSONB: Storage space. Even when using JSONB, the keys in the JSON are repeated in every row, which creates a significant storage overhead (e.g., 300%) compared to regular tables. So what if we compressed JSONB data using our cstore_fdw extension?
For this experiment, we set up 4 “customer_reviews” tables using different storage methods.
- A regular table (12 columns)
- A cstore_fdw table (12 columns)
- A regular table with a single JSONB column
- A cstore_fdw table with a single JSONB column
We loaded example CSV and JSON data into the tables using COPY and measured the table size using pg_relation_size and cstore_table_size. We found that using JSONB with cstore_fdw reduced the storage size by over 6 times compared to ordinary JSONB, and it’s even much smaller than a regular table.
The reduced storage size does come at a cost. Data loading time goes up a bit because of the time spent compressing the data. Below are the load times on a c3.large EC2 instance.
Since the data needs to be decompressed while running queries, query times go up as well. However, the good news is that they do not go up by nearly as much as the storage space goes down and CPU time can be reduced by scaling out using CitusDB. Below are the average query times over 10 runs on a c3.large EC2 instance for the following example query from the cstore_fdw github page or the JSONB equivalent:
SELECT width_bucket(length(product_title), 1, 50, 5) title_length_bucket, round(avg(review_rating), 2) AS review_average, count(*) FROM customer_reviews WHERE product_group = 'Book' GROUP BY title_length_bucket ORDER BY title_length_bucket;
Using JSONB with cstore_fdw significantly reduces the cost of storage, while fitting much more data into memory. The benefits of compression will really come to light when queries are I/O-bound. In the coming weeks, we will do a follow-up post in which we discuss more examples, tuning of compressed JSONB tables, and parallelizing queries using CitusDB.