PostgreSQL is one of the most popular open-source databases thanks to its reliability and advanced feature set. At Citus Data, we scale out PostgreSQL across many machines to enable queries over large datasets. We are often asked how CitusDB performance compares to Hadoop-based SQL engines that also scale out on commodity hardware. So we benchmarked the latest version of CitusDB and two of the leading Hadoop-based SQL solutions, Impala and SparkSQL, using the TPC-H benchmark.
We primarily wanted to know where we stand and what we can do to make PostgreSQL scale even better, so our aim has been to do a reasonably fair comparison and we spent several weeks becoming familiar with each system and tuning it appropriately.
We compared the latest version of CitusDB against Impala, because it was the best performing SQL-on-Hadoop solution in a recent paper by IBM, which performed a similar benchmark on dedicated hardware using Impala 1.2.2. The latest version of Impala that we used in the benchmarks, Impala 2.0.0, has even better performance. Spark has demonstrated very good performance and scale and the growth in its popularity led us to also include SparkSQL 1.1.0 in our benchmark.
We ran the industry standard TPC-H benchmark, which simulates a decision support system for a retailer. Why TPC-H? Because it is hard! The queries stress the database in different ways and there are no easy workarounds. While TPC-H has its limitations like any other benchmark, it provides a standardized baseline to compare to other systems and improve upon. We used the queries from the TPC-H benchmark that we could run and validate on each system.
We set up CitusDB on Amazon EC2 using a CloudFormation template, which generated and loaded the TPC-H data into CitusDB. Spark and Impala also provide tools for setting up a cluster on EC2. We loaded the TPC-H dataset into HDFS for Spark and Impala using hadoop distcp from an S3 bucket in text format, and then loaded the data into tables stored as parquet files to get the best possible performance. We ran the TPC-H queries in mostly unmodified form, apart from a few minor changes needed to work around certain limitations, such as using strings instead of a native Date type in Impala and SparkSQL. We tuned each system to get the best query performance. In particular, there was a need to improve memory settings (e.g., shared_buffers=1GB and work_mem=512MB for PostgreSQL, MEM_LIMIT=-1 for Impala, storage.memoryFraction=0.65 for SparkSQL), and tune a few other settings such as increasing timeouts in SparkSQL.
Impala and SparkSQL are optimized for in-memory workloads, so we wanted to start with the scenario where the entire database fits in memory. We set up 21 r3.4xlarge EC2 instances, one of which is used as the master from which the query is run. These instances have the following specs:
This set-up should be sufficient to process 1TB of data in memory, but some join queries may still use disk to store large intermediate results.
Prior to running any queries we ran all the necessary commands for optimizing query performance (analyze for PostgreSQL, compute stats for Impala, cache for Spark). We then ran the TPC-H queries that are supported on each system 5 times in order, and omit the result of the first round. The first run can be somewhat variable in the in-memory case, because part of the the data still has to be loaded into memory. Spark doesn't cache any data in memory before the first run, while Impala and CitusDB rely on the Linux page cache.
The best-case (thick bar) and worst-case (thin bar) query times are shown below.
The results show that CitusDB is the fastest in 8 of the 11 queries that could run.
The gap between the query times is the biggest in case of Q3 and Q19. As we mentioned, each TPC-H query stresses the database in different ways. In case of Q3 one of the challenges is to deal with a very large, sorted result set of which only the top 10 rows need to be returned, and in Q19 the main challenge is to handle the OR clauses in its filter. CitusDB handles these cases efficiently in its query plan optimizer.
SparkSQL is slower than CitusDB in all cases and fails on some on the bigger join queries. The behaviour we observed is that it evicts some of the intermediate results that it needs later on from its cache, and then tries to re-compute them, but loses other results in doing so, and this happens repeatedly until the query fails. It should be noted that SparkSQL, the successor to Shark, is still in an early stage of development.
It is often not possible to fit all the data in memory, even with more memory available at lower costs. When the dataset does not fit in memory, it will be read from a hard drive or SSD while the query is running. Intermediate data used to execute the query may also not fit in memory. Columnar storage is well-suited for this use-case, since it can dramatically reduce the amount of data that has to be read from disk by only reading relevant columns. For CitusDB, we use our own cstore_fdw extension for PostgreSQL to store tables on disk.
For this benchmark we used 1 r3.4xlarge as a master, and 4 i2.4xlarge EC2 instances, which have similar specs to the r3.4xlarge, but have more and bigger drives:
We again ran each query 4 times and show the best-case (thick bar) and worst-case (thin bar) query times below.
When data does not fit in memory, CitusDB with cstore_fdw is again fastest in 8 out of 11 queries. The remaining queries (Q1, Q6, Q12) perform several aggregate functions like sum and avg over the biggest table (lineitem). The reason that those are relatively slower is that PostgreSQL reads rows one by one, incrementally updating the function result. Vectorized execution pulls in the data for many rows at once and computes the aggregate for them in one go. We've recently prototyped vectorized execution for PostgreSQL, which makes aggregate functions 3-4x faster, but did not use those changes in these benchmarks.
PostgreSQL is generally more efficient at reading data from disk because it can use the local file-system directly and is optimized for doing so. Impala and SparkSQL read data from HDFS, which handles sharding and replication for them, but comes with a performance penalty. CitusDB implements its own replication and sharding scheme, in which each shard or replica is stored as a separate table on a worker node.
Unfortunately, SparkSQL failed to run several of the join queries. We've also seen cases where a SparkSQL cluster would stop working after a few queries, possibly due to the way it caches intermediate results.
CitusDB and cstore_fdw make PostgreSQL scale to many machines and large datasets on commodity hardware. On a standard benchmark, its performance is already comparable to the fastest Hadoop-based solutions, and often much better across a variety of on-disk and in-memory workloads. Moreover, PostgreSQL is a mature database, which gives you a lot of other possibilities to improve performance (e.g., user-defined functions, indexes), lets you use your own extensions and custom queryable data types (e.g., JSON), and is generally very reliable, making it a great complement to any horizontally scalable infrastructure.
Stay tuned, as we will also be releasing an exciting addition to scalable PostgreSQL in the upcoming months!
One of the ideas we wanted to explore more has been speeding up in-memory aggregations in PostgreSQL through vectorized execution. The opportunity to do so came up when we had our intern, Can, take this on as a project during his summer internship. The early numbers he has there are promising - suggesting a 3-4x increase in PostgreSQL performance for simple SELECT queries with sum/count/group by operations.
This is proof-of-concept work conducted within several weeks and not with the production-ready diligence that we always have on our projects - hence we made a very explicit point to add "_test" to the end of the project name. That said, it shows good promise for further increases in performance, and given the ideas there can be useful more broadly, we are happy to open source the project.
Please take a look at postgres_vectorization_test on GitHub, and let us know what you think! The readme also has more details on our approach, sample queries, performance comparisons, and instructions on getting it set up as a PostgreSQL extension: https://github.com/citusdata/postgres_vectorization_test
With this, my special thanks go to our intern, Can, for tackling this in the short amount of time he had, to Metin for mentoring Can, and of course, to Ozgun, for pulling everything together for an exciting summer project.