Merging the performance of PostgreSQL and the scalability of Apache Hadoop to enable real-time queries on data that's already in Hadoop

Ozgun Erdogan | 19th February, 2013

We are excited to bring together the performance of PostgreSQL and the scalability of Apache Hadoop, and enable real-time queries on data that's already in Hadoop. This new functionality becomes possible with CitusDB's powerful distributed query planner, and PostgreSQL's foreign data wrappers.

From a technical standpoint, one can view this functionality as removing Apache Hive's real-time query limitations, and compare CitusDB to a new class of analytics databases inspired by Google's Dremel. In this context, CitusDB shares many similarities with these systems; it runs queries natively without any data loads, and offers SQL access to data stored in different formats on Hadoop clusters.

What really excites us is that CitusDB enables all these features through a battle-tested database. While existing Hadoop-based systems intend to build everything that relates to a distributed database from scratch, Citus builds on top of PostgreSQL and brings many features and optimizations developed in the world of databases. Some of these include advanced disk I/O optimizations, wide set of data types and user-defined functions (UDFs), and enterprise class features such as client authentication and internationalization.

All of this becomes possible by colocating PostgreSQL databases on Hadoop nodes, and directly reading local Hadoop data through Postgres foreign data wrappers. For distributed query execution, users synchronize Hadoop metadata to a CitusDB coordinator node, and the coordinator then distributes and executes incoming SQL queries in real-time. The source code for a Beta release of our synchronization module is available on GitHub; and our documentation covers all the steps needed to get CitusDB running using our public packages or EC2 machine images.

In terms of its high level architecture, our approach looks like the following:


CitusDB Architecture


Several important benefits to using this approach are:

  • Bypassing MapReduce when accessing distributed data, thereby enabling real-time queries
  • Avoiding network bottlenecks by pushing queries to the data nodes, and running them locally there
  • Instantly running SQL queries on Hadoop clusters without having to load any data into a database
  • Leveraging decades of performance and feature work done on PostgreSQL (see questions below)

We cover further technical details in our documentation, and below answer some questions that frequently come up.

How does CitusDB's performance compare against Apache Hive?

The precise amount of performance gains in CitusDB depends on the underlying data’s format and the nature of the SQL queries. The following lists our initial findings; and we are working on producing fair and repeatable benchmark results to share with the community in the upcoming months.

  • CitusDB has notable latency advantages when executing short queries. For example, it can execute a simple query on Hadoop in as little as 100ms, but we don't think such queries make fair performance comparisons.
  • When we compare CitusDB against Apache Hive using the industry standard TPC-H benchmark, we see performance gains of 3-5x for data in text format. When the data is in binary format, these gains increase up to 4-20x.
  • Hive's query performance depends on the join order you specify. CitusDB doesn't have that restriction, and for fair comparisons, we are experimenting with different join orders for Hive QL queries.

How does CitusDB's feature set compare against other Hadoop-based systems?

CitusDB builds on top of PostgreSQL, and leverages all optimizations and features that come with a database. These out of the box benefits include:

Does CitusDB recover from failures?

Yes. The CitusDB master node intelligently re-routes the work on any failed nodes to the remaining nodes in real-time. Since the underlying data are kept in fixed-size blocks in HDFS, a failed node's work can evenly be distributed among the remaining nodes in the cluster.

What about CitusDB master node failures?

CitusDB handles master node failures through PostgreSQL's streaming replication feature. Users set up streaming replicas of the master node, and edit configuration to fail over to one of these replicas in case the master node fails. Further, users can issue read-only SELECT queries against these streaming replicas, but write queries that edit metadata can only go through the active master.

As a side note, any CitusDB node can "act" as a master as long as it has access to table and block metadata. We currently don't cache this metadata on the worker nodes though, in order to avoid potential data consistency issues.

What about running SQL on other distributed databases such as MongoDB and HBase?

CitusDB already provides SQL support for MongoDB through distributed foreign tables. For HBase, we are considering adding support soon. In practice, we find that supporting SQL on any distributed database requires the following:

  • A mechanism to synchronize shard metadata from the distributed database into the CitusDB master node
  • A foreign data wrapper designed to run locally on each worker node, and that can convert external shard data into PostgreSQL's tuple format.

What are CitusDB's technical limitations?

As previously mentioned, CitusDB doesn't yet support the entire spectrum of SQL queries. Features that are missing from the current release are outlined in here; and we found that many of these limitations can be overcome by modeling the data differently. Still, we are removing all restrictions for distinct clauses and large table joins, and intend to fully support them in v2.1.

Further, CitusDB only supports file formats that have foreign data wrappers implemented for them. This at the moment only includes text data in tabular format, but we are working on implementing foreign wrappers for many other data formats. Finally, we associate one HDFS block with one foreign table, and execute the entire SQL query locally on that block. If bytes for the last record(s) in one HDFS block spill over to the next one, we currently don't fetch those bytes and instead skip the last record. This again is a limitation we intend to remove in v2.1.

What other data formats are in the works?

We are currently implementing foreign data wrappers for JSON, Avro, and Sequence files. We also intend to support common compression formats such as Snappy and Gzip with these foreign wrappers. The nice thing about PostgreSQL's wrappers is that their APIs are publicly available; and one can immediately start benchmarking SQL queries against an HDFS block (local file) after writing the wrapper.