cstore_fdw 1.1 release notes

We are excited to announce the release of cstore_fdw 1.1, Citus Data's open source columnar store extension for PostgreSQL. The changes in this release include:

  • Automatic file management. The filename foreign table option has become optional, and cstore_fdw uses a default directory inside PostgreSQL’s data directory to manage cstore tables.
  • Automatically delete table files on DROP FOREIGN TABLE. In cstore_fdw v1.0 it was a user's responsibility to delete the files created by cstore_fdw after dropping a table. Failure to properly delete the files could result in unexpected behavior in the future. For example, if a user dropped a table and then created another table with the same filename option they could get errors when querying the new table. cstore_fdw now automatically deletes table files on DROP FOREIGN TABLE and eliminates these kinds of problems.
  • cstore_table_size. The new cstore_table_size('tablename') function can be used to get the size of a cstore table in bytes.
  • Improved documentation. “Using Skip Indexes” and “Uninstalling cstore_fdw” sections were added to the README file.
  • Bug fixes:
    • Previously querying empty tables errored out. These tables can now be queried as expected.
    • Previously cost estimation functions overestimated number of columns. The source of estimation error has been fixed.

For installation and update instructions, please see cstore_fdw’s page in GitHub.

To learn more about what’s coming up for cstore_fdw see our development roadmap.

Got questions?

If you have questions about cstore_fdw, please contact us using the cstore-users Google group.

If you discover an issue when using cstore_fdw, please submit it to cstore_fdw’s issue tracker on GitHub.

PostgreSQL columnar store benchmarks on SSDs

In April we released cstore_fdw, the first columnar store foreign data wrapper for PostgreSQL. Our initial blog post received lots of interest in cstore_fdw but also lots of questions. In this blog post we're going to attempt to answer some of the more common performance related questions:

  • How much disk space does compression save me?
  • How much faster are queries using cstore_fdw?
  • How much do cstore_fdw skiplist indexes improve performance?

Before we start we will briefly touch on the test environment. All of the tests below were done using data and some representative queries generated by the industry standard TPC-H benchmark. We ran our tests on EC2 hosts using 10 GB of generated TPC-H data loaded into CitusDB version 3.0 or PostgreSQL version 9.3.4. All database files were stored on SSD drives. CitusDB and PostgreSQL were configured identically with the only variations from the stock configuration being setting shared_buffers to 2 GB and work_mem to 512 MB.

How much disk space does compression save me?

cstore_fdw allows users to enable compression on their tables to reduce the size of the data at rest on disk and to reduce the amount of I/O required to read the table. We won't dive into the details of the compression technique used in this post but a short description is that each column of data is broken up into blocks of ten thousand values which are compressed using PostgreSQL's built-in LZ compression. Compression has four primary benefits:

  • You spend less money on expensive SSDs to store your data.
  • You read less from your disks freeing up limited disk I/O for other tasks.
  • Smaller files are more likely to fit into buffer cache which significantly improves query times.
  • Disk access is often a performance bottleneck and by reading less from them you improve query times.

In order to test how well our compression technique works we took 10 GB of raw data generated by TPC-H and loaded that into a standard PostgreSQL database and into a PostgreSQL database that uses compressed cstore_fdw tables. We then compared the sizes of the tables in each of the databases. The table below summarizes the compression ratios we saw for each table:

Table Type Lineitem (9.1 GB) Orders (2.1 GB) Part (0.3 GB) Partsupp (1.4 GB) Customer (0.3 GB)
Regular 1 1 1 1 1
cstore 0.98 0.87 0.96 0.82 0.93
cstore (compressed) 0.26 0.27 0.22 0.26 0.40
Generated TPC-H data with scale factor 10

Depending on the TPC-H table compression ratios varied between 2.5x and 16x. This type of variability is expected depending on the size of the table and the type of the data contained in it. One other interesting note is that our 13.1 GB PostgreSQL database compressed down to only 3.5 GB when using our compressed storage format.

Using compression with cstore_fdw we would expect users to see compression ratios of 3-4x for their tables.

How much faster is cstore_fdw relative to PostgreSQL?

cstore_fdw improves query times by reducing the amount of I/O in two ways:

  • Reading data only for columns needed to answer a query as opposed to reading all columns in each row.
  • Compressing data on disk.

We wanted to evaluate how much of an improvement in query times users of cstore_fdw were likely to see in the real world. To do this we used the data and queries from the TPC-H benchmark as these are designed to represent real-world use cases. Our test consisted of loading 10 GB of generated TPC-H data into both a standard PostgreSQL database and a PostgreSQL database that used cstore_fdw storage. We then flushed the OS cache to make sure that all data was coming from SSDs. Next, we ran a selection of TPC-H queries against each storage engine.

Table Type 3 5 6 7 8 10 12 14 19
PostgreSQL 36.78 34.37 25.92 36.17 37.42 36.55 33.86 31.91 28.60
cstore 26.60 28.34 15.75 29.03 32.15 27.88 26.28 15.50 33.00
cstore (compressed) 25.84 27.68 14.28 28.28 31.26 25.74 24.70 14.88 25.54
Run on PostgreSQL 9.3.4

As shown above the improvement from cstore_fdw varies depending on the query. In the best case query runtimes dropped to half their previous levels, in the worst case the improvement is only 11% (with compression enabled).

Explaining this variation requires a bit of an explanation of how databases work. The total query time can be thought of as the sum of two types of work: time spent doing I/O and time spent doing computation. Queries that use complex filters, aggregations, or other functions spend increased amounts of time doing computation. For these types of queries it can be difficult to significantly affect overall query time by reducing I/O as the majority of the query time is spent doing non-I/O related work. However, for many common queries there is relatively little computation and reducing I/O does have a significant impact on overall query times.

Across the set of TPC-H queries we see that when using cstore_fdw with compression every query executes faster and users can expect their analytic queries to run 20-30% faster. We'll show how to turn this 20-30% improvement into a 60-70% one later in our investigation on skiplist indexes.

How much faster is cstore_fdw relative to CitusDB?

CitusDB enables its users to create a horizontally scalable PostgreSQL database that can leverage the PostgreSQL extension ecosystem. To determine the benefits of cstore_fdw for CitusDB users we set up an experiment identical to the one we did for PostgreSQL except this time we compared two CitusDB clusters; one using PostgreSQL's native storage format for the TPC-H data and one using cstore_fdw to store the same TPC-H data.

Table Type 3 5 6 7 8 10 12 14 19
CitusDB 35.87272 35.617836 25.683736 35.915007 39.537044 40.360714 31.864123 29.510186 29.593161
CitusDB + cstore 13.664943 13.34371 6.176959 14.522777 18.638935 17.483885 9.619221 9.401148 17.571072
CitusDB + cstore (compressed) 12.476285 12.060501 5.600474 12.884196 17.033522 16.009109 7.595766 8.272748 12.080371
Run on CitusDB 3.0

The results are similar to those when using cstore_fdw on PostgreSQL, however there is a significant difference in the amount of improvement seen. When using PostgreSQL we saw an average improvement to query times of 28% whereas when using CitusDB we see an average improvement to query times of 66%. This occurs because CitusDB enables you simultaneously use all cores on a machine to answer a query. In doing so CitusDB introduces more demand and hence more contention for limited available disk I/O. As a result the reduction in I/O times that cstore_fdw brings contributes more significantly to improving query times. Users running cstore_fdw on PostgreSQL in an environment with multiple clients querying in parallel should expect similar results.

cstore_fdw users with CitusDB can expect to see their performance improve by 60-70% for their analytic query workloads.

How much do cstore_fdw skiplist indexes improve performance?

Indexing is a common technique for improving query performance in databases and PostgreSQL has a number of indexing techniques available for its users. However, when using a foreign data wrapper, like cstore_fdw, users can no longer leverage PostgreSQL's built-in indexing techniques. In order to give users of cstore_fdw the benefits of indexes we adapted the indexing technology from ORC. The ORC indexing system works by breaking each column's data into blocks and storing the min and max values of each block. In order to determine how much of an improvement this indexing technique offers we used the TPC-H data loaded onto a couple PostgreSQL databases using cstore_fdw. The most effective use of ORC's indexing system requires that the column data be sorted. Into one of these databases we loaded sorted data and into the other unsorted data was loaded. As previously we then ran the TPC-H queries against each of these databases. It's worth noting that all data was sorted by its natural ordering, meaning for example that orders were sorted by order date.

One thing to note about these results is that the tests were done without compression enabled. Users should expect to see additional improvements from enabling compression as described in previous sections.

Table Type 3 5 6 7 8 10 12 14 19
cstore 34.961275 36.959757 15.956601 37.265958 40.437036 33.260259 29.418966 17.442527 36.130426
cstore (sorted) 26.47118 34.780422 4.620994 26.371364 38.380606 21.975958 13.398782 3.646738 37.277029
Run on PostgreSQL 9.3.4

When using cstore_fdw with sorted input users should expect a 30-35% improvement to query times compared to using unsorted input and a 60-70% improvement compared to PostgresSQL.

Conclusions

After benchmarking cstore_fdw in a number of analytic query scenarios we learned that on SSDs:

  • cstore_fdw users can expect to see their databases shrink by 3-4x
  • cstore_fdw users with PostgreSQL can expect to see their query times improve by 20-30%
  • cstore_fdw users with CitusDB can expect to see their query times improve by 60-70%
  • cstore_fdw users can use indexing for a further ~30% improvement to query times

We hope this post left you interested to see what cstore_fdw and CitusDB can do for your data. Feel free to contact us with questions or comments using our Contact Us form or by simply emailing us at engage @ citusdata.com.

To download cstore_fdw and Citus DB see these links: cstore_fdw and CitusDB.

Page 1 of 8

About

CitusDB is a scalable analytics database that's built on top of PostgreSQL.

In this blog, we share our ideas and experiences on databases and distributed systems.