“Does PostgreSQL support compression?” is a common question we get from our customers, and it’s easy to see why they’re asking. Many of them generate and collect large volumes of log and event stream data and store it in text formats such as JSON or CSV. Typically these files are then compressed using a codec like gzip, and compression ratios approaching 3x-4x are not uncommon. The problem is that you sacrifice these storage savings as soon as you load the data into a database that doesn’t support compression.
Unfortunately, the answer to this question is neither a clear cut “yes” or “no”. If certain criteria are satisfied PostgreSQL’s storage layer will transparently compress variable length types using a mechanism known as The Oversized-Attribute Storage Technique, or TOAST. The problem is that TOAST’s compression won’t kick in unless the size of a row exceeds the TOAST_TUPLE_THRESHOLD value (which is set to 2KB on most systems), and even when this happens TOAST will only compress the variable length types contained in the row. TOAST does a good job of satisfying its intended purpose, which is to avoid the overhead and complexity of row-chaining by compressing a row to make it fit within one of PostgreSQL’s 8KB buffer cache pages, but it’s a poor substitute for a general purpose compression layer.
In the absence of general purpose compression support in PostgreSQL we began to wonder if we could achieve some of the same benefits by leveraging compression at the filesystem layer. In the past decade several file systems have added support for transparent compression. Currently the most mature option is ZFS, an open-source filesystem originally designed at Sun Microsystems. In addition to transparent, in-line compression with a choice of several codecs (gzip, LZJB, and LZ4), ZFS also supports a bunch of other powerful features including software RAID, virtual storage pools, snapshots, deduplication and encryption. ZFS’s biggest limitation is its CDDL license which prevents it from being distributed with the Linux kernel. Recently, this situation has improved significantly as a result of the ZFS on Linux Project which provides ZFS packages for many of the major Linux distributions.
In order to verify our hunch that ZFS is a viable substitute for native compression support in the database, we decided to test the performance and compression characteristics of CitusDB running on top of ZFS and compare this to the default configuration that uses ext3. Further, we focused on a workload that was completely bound by disk I/O in our experiments.
We ran the tests on EC2 using two c1.xlarge instances initialized with Ubuntu 12.04 LTS. We installed CitusDB 2.0 on both nodes and then arranged the nodes in a 1:1 master/worker configuration. For sample data we used 38GB of event logs from the GithubArchive dataset. We converted the sample data from JSON to CSV format using a script similar to one provided in the GithubArchive project, and then loaded the results into CitusDB using the \STAGE command and this schema. After the load operation finished we calculated the compression ratio by measuring the size of the data directory on the worker node.
The results indicate that ZFS achieved a 2x compression ratio using the default LZJB compression codec, and close to a 4x compression ratio using the gzip codec.
Before we get carried away with these results we need to quantify the impact compression has on performance. Running ZFS with compression enabled forces the file system do additional work on top of every disk read and write operation. It’s reasonable to assume that this will have a negative impact on overall performance, but the question we really need to answer is whether the performance impact is so significant that it outweighs the benefits. In order to do this we devised a simple benchmark composed of seven queries, which we then executed against the 38GB dataset we loaded in the previous step. It’s important to note that the CitusDB worker nodes will process the shards in parallel, which implies that we are also testing the ability of the file system or kernel’s IO scheduler to handle concurrent read requests.
These results are surprising to say the least. We know that enabling compression forces the CPU to do additional work on top of every read and write operation, so how is it possible for Query 1 to run nearly four times faster when the underlying data is compressed with gzip? The explanation for this counterintuitive result hinges on the fact that these queries are IO bound. In other words, the primary bottleneck limiting the performance of these queries is not the speed of the CPU but rather the rate at which data can be read from the disk into the kernel’s buffer cache. With compression enabled these operations take less time because less data needs to travel between the disk and the buffer cache. In addition, seek times are lower because compression reduces the physical distances between logical blocks of data on the disk, and the disk buffer’s hit rate improves since it is able to store more data in the same amount of space (note that we’re talking about the hard drive’s buffer -- this last advantage does not apply to the buffer cache since ZFS decompresses blocks before saving them in the cache).
While doing background research for this blog post we also got a chance to investigate some of the other features besides compression that differentiate ZFS from older file system architectures like ext3. One of the biggest differences is ZFS’s approach to scheduling disk IOs which employs explicit IO priorities, IOP reordering, and deadline scheduling in order to avoid flooding the request queues of disk controllers with pending requests. We were curious to see how well these techniques work in practice. To do that we dropped the old githubarchive table and loaded a new copy, this time splitting the table into 23 shards instead of 2. The total amount of data that each query runs against remains unchanged, but CitusDB will now divide the work between 23 processes instead of 2.
The first thing we need to point out is that the horizontal axis on this chart is twice as large as the one in the previous section. The second thing worth noting is that this test hosed ext3. Queries which took on average 350 seconds to execute with two processes now take roughly twice as long when run with 23 processes. These results are disappointing but won’t surprise anyone who is familiar with the performance characteristics of sequential and random reads on conventional file systems. Disks can handle sequential reads much faster than random reads since the former do not require the disk to reposition the read head. In both test runs the individual CitusDB worker processes made sequential read requests, but this characteristic disappears and things start to look random if the IO scheduler simply enqueues IO requests in the order in which they are received. In contrast ZFS’s IO scheduler attempts to reorder IO operations in an effort to limit the amount of time the disk spends seeking from one block to the next, and the results we obtained here support the conclusion that ZFS’s approach actually works.
So what are the takeaways from this experiment? First and foremost we succeeded in proving our hypothesis that ZFS makes it possible to achieve 3-4x compression ratios for event stream and log datasets. Second, we demonstrated that ZFS and compression actually improves performance when queries are IO bound. Finally, we also determined that ZFS on Linux is maturing and will likely become a viable option for enterprise deployments in the future.
Carl wishes to thank Metin Doslu for contributing background research and results that are used in this article.
We are excited to bring together the performance of PostgreSQL and the scalability of Apache Hadoop, and enable real-time queries on data that's already in Hadoop. This new functionality becomes possible with CitusDB's powerful distributed query planner, and PostgreSQL's foreign data wrappers.
From a technical standpoint, one can view this functionality as removing Apache Hive's real-time query limitations, and compare CitusDB to a new class of analytics databases inspired by Google's Dremel. In this context, CitusDB shares many similarities with these systems; it runs queries natively without any data loads, and offers SQL access to data stored in different formats on Hadoop clusters.
What really excites us is that CitusDB enables all these features through a battle-tested database. While existing Hadoop-based systems intend to build everything that relates to a distributed database from scratch, Citus builds on top of PostgreSQL and brings many features and optimizations developed in the world of databases. Some of these include advanced disk I/O optimizations, wide set of data types and user-defined functions (UDFs), and enterprise class features such as client authentication and internationalization.
All of this becomes possible by colocating PostgreSQL databases on Hadoop nodes, and directly reading local Hadoop data through Postgres foreign data wrappers. For distributed query execution, users synchronize Hadoop metadata to a CitusDB coordinator node, and the coordinator then distributes and executes incoming SQL queries in real-time. The source code for a Beta release of our synchronization module is available on GitHub; and our documentation covers all the steps needed to get CitusDB running using our public packages or EC2 machine images.
In terms of its high level architecture, our approach looks like the following:
Several important benefits to using this approach are:
We cover further technical details in our documentation, and below answer some questions that frequently come up.
How does CitusDB's performance compare against Apache Hive?
The precise amount of performance gains in CitusDB depends on the underlying data’s format and the nature of the SQL queries. The following lists our initial findings; and we are working on producing fair and repeatable benchmark results to share with the community in the upcoming months.
How does CitusDB's feature set compare against other Hadoop-based systems?
CitusDB builds on top of PostgreSQL, and leverages all optimizations and features that come with a database. These out of the box benefits include:
Does CitusDB recover from failures?
Yes. The CitusDB master node intelligently re-routes the work on any failed nodes to the remaining nodes in real-time. Since the underlying data are kept in fixed-size blocks in HDFS, a failed node's work can evenly be distributed among the remaining nodes in the cluster.
What about CitusDB master node failures?
CitusDB handles master node failures through PostgreSQL's streaming replication feature. Users set up streaming replicas of the master node, and edit configuration to fail over to one of these replicas in case the master node fails. Further, users can issue read-only SELECT queries against these streaming replicas, but write queries that edit metadata can only go through the active master.
As a side note, any CitusDB node can "act" as a master as long as it has access to table and block metadata. We currently don't cache this metadata on the worker nodes though, in order to avoid potential data consistency issues.
What about running SQL on other distributed databases such as MongoDB and HBase?
CitusDB already provides SQL support for MongoDB through distributed foreign tables. For HBase, we are considering adding support soon. In practice, we find that supporting SQL on any distributed database requires the following:
What are CitusDB's technical limitations?
As previously mentioned, CitusDB doesn't yet support the entire spectrum of SQL queries. Features that are missing from the current release are outlined in here; and we found that many of these limitations can be overcome by modeling the data differently. Still, we are removing all restrictions for distinct clauses and large table joins, and intend to fully support them in v2.1.
Further, CitusDB only supports file formats that have foreign data wrappers implemented for them. This at the moment only includes text data in tabular format, but we are working on implementing foreign wrappers for many other data formats. Finally, we associate one HDFS block with one foreign table, and execute the entire SQL query locally on that block. If bytes for the last record(s) in one HDFS block spill over to the next one, we currently don't fetch those bytes and instead skip the last record. This again is a limitation we intend to remove in v2.1.
What other data formats are in the works?
We are currently implementing foreign data wrappers for JSON, Avro, and Sequence files. We also intend to support common compression formats such as Snappy and Gzip with these foreign wrappers. The nice thing about PostgreSQL's wrappers is that their APIs are publicly available; and one can immediately start benchmarking SQL queries against an HDFS block (local file) after writing the wrapper.