Making PostgreSQL scale Hadoop-style, Benchmark numbers

Written by Marco Slot
October 31, 2014

PostgreSQL is one of the most popular open-source databases thanks to its reliability and advanced feature set. At Citus Data, we scale out PostgreSQL across many machines to enable queries over large datasets. We are often asked how CitusDB performance compares to Hadoop-based SQL engines that also scale out on commodity hardware. So we benchmarked the latest version of CitusDB and two of the leading Hadoop-based SQL solutions, Impala and SparkSQL, using the TPC-H benchmark.

We primarily wanted to know where we stand and what we can do to make PostgreSQL scale even better, so our aim has been to do a reasonably fair comparison and we spent several weeks becoming familiar with each system and tuning it appropriately.

Benchmark set-up

We compared the latest version of CitusDB against Impala, because it was the best performing SQL-on-Hadoop solution in a recent paper by IBM, which performed a similar benchmark on dedicated hardware using Impala 1.2.2. The latest version of Impala that we used in the benchmarks, Impala 2.0.0, has even better performance. Spark has demonstrated very good performance and scale and the growth in its popularity led us to also include SparkSQL 1.1.0 in our benchmark.

We ran the industry standard TPC-H benchmark, which simulates a decision support system for a retailer. Why TPC-H? Because it is hard! The queries stress the database in different ways and there are no easy workarounds. While TPC-H has its limitations like any other benchmark, it provides a standardized baseline to compare to other systems and improve upon. We used the queries from the TPC-H benchmark that we could run and validate on each system.

We set up CitusDB on Amazon EC2 using a CloudFormation template, which generated and loaded the TPC-H data into CitusDB. Spark and Impala also provide tools for setting up a cluster on EC2. We loaded the TPC-H dataset into HDFS for Spark and Impala using hadoop distcp from an S3 bucket in text format, and then loaded the data into tables stored as parquet files to get the best possible performance. We ran the TPC-H queries in mostly unmodified form, apart from a few minor changes needed to work around certain limitations, such as using strings instead of a native Date type in Impala and SparkSQL. We tuned each system to get the best query performance. In particular, there was a need to improve memory settings (e.g., shared_buffers=1GB and work_mem=512MB for PostgreSQL, MEM_LIMIT=-1 for Impala, storage.memoryFraction=0.65 for SparkSQL), and tune a few other settings such as increasing timeouts in SparkSQL.

In-memory query performance

Impala and SparkSQL are optimized for in-memory workloads, so we wanted to start with the scenario where the entire database fits in memory. We set up 21 r3.4xlarge EC2 instances, one of which is used as the master from which the query is run. These instances have the following specs:

  • 8 physical cores per node (160 total)
  • 122GB memory per node (2.4TB total)
  • 1x320GB SSD (6.4TB total)

This set-up should be sufficient to process 1TB of data in memory, but some join queries may still use disk to store large intermediate results.

Prior to running any queries we ran all the necessary commands for optimizing query performance (analyze for PostgreSQL, compute stats for Impala, cache for Spark). We then ran the TPC-H queries that are supported on each system 5 times in order, and omit the result of the first round. The first run can be somewhat variable in the in-memory case, because part of the the data still has to be loaded into memory. Spark doesn't cache any data in memory before the first run, while Impala and CitusDB rely on the Linux page cache.

The best-case (thick bar) and worst-case (thin bar) query times are shown below.

memory chart of query times

The results show that CitusDB is the fastest in 8 of the 11 queries that could run.

The gap between the query times is the biggest in case of Q3 and Q19. As we mentioned, each TPC-H query stresses the database in different ways. In case of Q3 one of the challenges is to deal with a very large, sorted result set of which only the top 10 rows need to be returned, and in Q19 the main challenge is to handle the OR clauses in its filter. CitusDB handles these cases efficiently in its query plan optimizer.

SparkSQL is slower than CitusDB in all cases and fails on some on the bigger join queries. The behaviour we observed is that it evicts some of the intermediate results that it needs later on from its cache, and then tries to re-compute them, but loses other results in doing so, and this happens repeatedly until the query fails. It should be noted that SparkSQL, the successor to Shark, is still in an early stage of development.

Query performance on SSDs

It is often not possible to fit all the data in memory, even with more memory available at lower costs. When the dataset does not fit in memory, it will be read from a hard drive or SSD while the query is running. Intermediate data used to execute the query may also not fit in memory. Columnar storage is well-suited for this use-case, since it can dramatically reduce the amount of data that has to be read from disk by only reading relevant columns. For CitusDB, we use our own cstore_fdw extension for PostgreSQL to store tables on disk.

For this benchmark we used 1 r3.4xlarge as a master, and 4 i2.4xlarge EC2 instances, which have similar specs to the r3.4xlarge, but have more and bigger drives:

  • 8 physical cores per node (160 total)
  • 122GB memory per node (610GB total)
  • 4x800GB SSD (12.8TB total)

We again ran each query 4 times and show the best-case (thick bar) and worst-case (thin bar) query times below.

disk chart of query times

When data does not fit in memory, CitusDB with cstore_fdw is again fastest in 8 out of 11 queries. The remaining queries (Q1, Q6, Q12) perform several aggregate functions like sum and avg over the biggest table (lineitem). The reason that those are relatively slower is that PostgreSQL reads rows one by one, incrementally updating the function result. Vectorized execution pulls in the data for many rows at once and computes the aggregate for them in one go. We've recently prototyped vectorized execution for PostgreSQL, which makes aggregate functions 3-4x faster, but did not use those changes in these benchmarks.

PostgreSQL is generally more efficient at reading data from disk because it can use the local file-system directly and is optimized for doing so. Impala and SparkSQL read data from HDFS, which handles sharding and replication for them, but comes with a performance penalty. CitusDB implements its own replication and sharding scheme, in which each shard or replica is stored as a separate table on a worker node.

Unfortunately, SparkSQL failed to run several of the join queries. We've also seen cases where a SparkSQL cluster would stop working after a few queries, possibly due to the way it caches intermediate results.

Conclusions

CitusDB and cstore_fdw make PostgreSQL scale to many machines and large datasets on commodity hardware. On a standard benchmark, its performance is already comparable to the fastest Hadoop-based solutions, and often much better across a variety of on-disk and in-memory workloads. Moreover, PostgreSQL is a mature database, which gives you a lot of other possibilities to improve performance (e.g., user-defined functions, indexes), lets you use your own extensions and custom queryable data types (e.g., JSON), and is generally very reliable, making it a great complement to any horizontally scalable infrastructure.

Stay tuned, as we will also be releasing an exciting addition to scalable PostgreSQL in the upcoming months!

Marco Slot

Written by Marco Slot

Former lead engineer for the Citus database engine at Microsoft. Speaker at Postgres Conf EU, PostgresOpen, pgDay Paris, Hello World, SIGMOD, & lots of meetups. Talk selection team member for Citus Con: An Event for Postgres. PhD in distributed systems. Loves mountain hiking.

@marcoslot marcocitus