OverviewArchitectureSingle Node ClusterExamplesAmazon Reviews(create indexes)TPCH Benchmark(perform joins)Foreign DataFile WrapperMongoDB WrapperSQL on HadoopGoogle DremelFAQPseudo-distributedDistributedMultiple Node ClusterLinux NodesEC2 NodesPerformance BoostFeatures Not in v2.0Before Production |
SQL on Hadoop (Beta)We describe here a setup that runs PostgreSQL instances on Hadoop's data nodes. Each database instance accesses data in colocated HDFS blocks through PostgreSQL foreign data wrappers, and a coordinator node distributes and executes incoming SQL queries. This way, users can immediately run real-time SQL queries on their Hadoop clusters without having to load any data into a database. In this section, we first give an overview of PostgreSQL foreign data wrappers and their use in a distributed setting. We then talk about our approach in the context of a new generation of analytics databases inspired by Google’s Dremel. Next, we describe the installation steps needed to get CitusDB up and running on a Hadoop cluster. OverviewHadoop File System (HDFS) is a distributed file system that partitions its data into blocks and then replicates these blocks across a number of nodes. Metadata on these blocks are kept in the HDFS NameNode, and the actual block data are stored on a cluster of DataNodes. One key property of Hadoop involves colocating Map tasks with block data, and therefore avoiding network transfers between DataNodes. CitusDB extends on this idea by running a PostgreSQL instance on each DataNode. No data needs to be loaded into Postgres; users simply write a foreign data wrapper that knows how to convert block data into PostgreSQL's tuple format. In a sense, these wrappers are similar to Hive's SerDes. The user then declares a table schema on the CitusDB master node, and associates this table with an HDFS directory path. Then, they run a Java application that synchronizes block metadata from the HDFS NameNode to the CitusDB master node. This application also creates a colocated foreign table for each new block in HDFS, and optionally collects statistics about the underlying blocks. Since this application only synchronizes new block metadata, and creates foreign tables that are effectively views on block data, it executes fast. Still, it is worth noting that we envision this application to evolve over time. We could extend the application to pick up metadata changes from the HDFS NameNode's transaction logs in real-time, or provide integration with the metastore in Hive. Once block metadata appear on the CitusDB master, users can start issuing real-time SQL queries against their data. The master node replans each incoming query for distributed execution, partitions it into smaller queries, and pushes these queries to PostgreSQL instances "holding" the relevant blocks. The master node also bypasses Map/Reduce entirely, and therefore removes any latency associated with running batch jobs. At a high level, the architecture looks like the following.
Several important benefits that stem from this approach are:
Google DremelGoogle Dremel and Apache Hive have popularized several ideas that we think now define a new generation in analytics databases. These ideas include supporting semi-structured data, defining schemas at read time, and colocating queries with nodes that have the data. We think these ideas define a new generation in analytics databases, and built CitusDB from the ground up with them in mind. Citus pushes queries to the related data nodes, and uses PostgreSQL foreign data wrappers to read data in any format. Further, CitusDB enables much more. Existing Hadoop-based systems start with a file system abstraction and intend to build everything that relates to a distributed database from scratch. CitusDB on the other hand builds on top of PostgreSQL, and heavily leverages all optimizations and features that come with a database. These out of the box benefits include:
Thanks to all these features that we inherit from Postgres, the CitusDB master node can solely focus on planning SQL queries for fast parallel execution. For each incoming query, the master node creates a distributed query plan that's commutative and distributive, and then breaks the query into one per relevant data block. The master node then pushes these queries to the data nodes and handles automatic recovery of failures. Queries that can be efficiently parallelized by the CitusDB master include filters, groupings, sorts, limits, aggregations, most table joins, and certain subselects. We are actively working on improving our SQL functionality; and features that we don't yet support can be found here. Frequently Asked QuestionsHow does CitusDB's performance compare against Apache Hive? The precise amount of performance gains in CitusDB depends on the underlying data’s format and the nature of the SQL queries. The following lists our initial findings; and we are working on producing fair and repeatable benchmark results to share with the community in the upcoming months.
Does CitusDB recover from failures? Yes. The CitusDB master node intelligently re-routes the work on any failed nodes to the remaining nodes in real-time. Since the underlying data are kept in fixed-size blocks in HDFS, a failed node's work can evenly be distributed among the remaining nodes in the cluster. What about CitusDB master node failures?
CitusDB handles master node failures through PostgreSQL's streaming replication feature. Users set up streaming replicas of the master node, and edit configuration to fail over to one of these replicas in case the master node fails. Further, users can issue read-only SELECT queries against these streaming replicas, but write queries that edit metadata can only go through the active master. As a side note, any CitusDB node can "act" as a master as long as it has access to table and block metadata. We currently don't cache this metadata on the worker nodes though, in order to avoid potential data consistency issues. How different is CitusDB from stock PostgreSQL?
CitusDB isn’t a fork of PostgreSQL; it simply extends Postgres to support distributed SQL queries. Also, CitusDB version numbers are aligned with major version upgrades in Postgres; CitusDB v1.0 is based on PostgreSQL 9.1, and v2.0 is based on PostgreSQL 9.2. In terms of distributed functionality, CitusDB in fact doesn't make a distinction between master and worker nodes. A database becomes the master node simply when the user creates a distributed table on it and uploads block metadata. This master node then executes built in logic to plan and execute distributed SQL queries. On the worker nodes, such logic doesn't need to exist, and one can use stock PostgreSQL there if:
What are CitusDB's technical limitations? As previously mentioned, CitusDB doesn't yet support the entire spectrum of SQL queries. Features that are missing from the current release are outlined in here; and we found that many of these limitations can be overcome by modeling the data differently. Still, we are removing all restrictions for distinct clauses and large table joins, and intend to fully support them in v2.1. Further, CitusDB only supports file formats that have foreign data wrappers implemented for them. This at the moment only includes text data in tabular format, but we are working on implementing foreign wrappers for many other data formats. Finally, we associate one HDFS block with one foreign table, and execute the entire SQL query locally on that block. If bytes for the last record(s) in one HDFS block spill over to the next one, we currently don't fetch those bytes and instead skip the last record. This again is a limitation we intend to remove in v2.1. What other data formats are in the works?
We are currently implementing foreign data wrappers for JSON, Avro, and Sequence files. We also intend to support common compression formats such as Snappy and Gzip with these foreign wrappers. The nice thing about PostgreSQL's wrappers is that their APIs are publicly available; and one can immediately start benchmarking SQL queries against an HDFS block (local file) after writing the wrapper. What about running SQL on other distributed databases such as MongoDB and HBase? CitusDB already provides SQL support for MongoDB through distributed foreign tables. For HBase, we are considering adding support soon. In practice, we find that supporting SQL on any distributed database requires the following:
Pseudo-Distributed ClusterAfter talking about our architecture's technical details, we describe here setting up a single node Hadoop cluster operating in pseudo-distributed mode. We then set up CitusDB databases on this node, and cover all the steps involved in querying the data that's in Hadoop. To simplify installation instructions, we use an EC2 AMI that includes Hadoop 1.1, CitusDB 2.0, and our metadata syncing application. You can alternatively use the Hadoop and CitusDB setup guides to download these packages and install from there.
localhost# ssh -i <private SSH key file> ec2-user@<ec2 nodename> ec2-node# wget http://examples.citusdata.com/customer_reviews_1998.csv.gz ec2-node# wget http://examples.citusdata.com/customer_reviews_1999.csv.gz ec2-node# gzip -d customer_reviews_1998.csv.gz ec2-node# gzip -d customer_reviews_1999.csv.gz ec2-node# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa ec2-node# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys ec2-node# cd hadoop-1.1.1 ec2-node# bin/hadoop namenode -format ec2-node# bin/start-all.sh ec2-node# bin/hadoop fs -mkdir /user/data/reviews ec2-node# bin/hadoop fs -put ~/customer_reviews_1998.csv /user/data/reviews ec2-node# bin/hadoop fs -put ~/customer_reviews_1999.csv /user/data/reviews ec2-node# /opt/citusdb/2.0/bin/initdb -D /opt/citusdb/2.0/data.9700 ec2-node# emacs -nw /opt/citusdb/2.0/data/pg_worker_list.conf # HOSTNAME [PORT] [RACK] localhost. 9700
localhost# /opt/citusdb/2.0/bin/pg_ctl -D /opt/citusdb/2.0/data -l logfile start
localhost# /opt/citusdb/2.0/bin/pg_ctl -D /opt/citusdb/2.0/data.9700 -o "-p 9700" \
-l logfile.9700 start
ec2-node# /opt/citusdb/2.0/bin/psql -h localhost -p 5432 -d postgres
postgres# CREATE EXTENSION file_fdw;
postgres# CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw;
postgres# CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT not null,
review_date DATE not null,
review_rating INTEGER not null,
review_votes INTEGER,
review_helpful_votes INTEGER,
product_id CHAR(10) not null,
product_title TEXT not null,
product_sales_rank BIGINT,
product_group TEXT,
product_category TEXT,
product_subcategory TEXT,
similar_product_ids CHAR(10)[]
)
DISTRIBUTE BY APPEND (review_date)
SERVER file_server
OPTIONS (filename '', hdfs_directory_path '/user/data/reviews', format 'csv');
ec2-node# cd ~/hadoop-sync ec2-node# java -jar target/hadoop-sync-0.1.jar customer_reviews --fetch-min-max
-- Find all reviews a particular customer made on the Dune series in 1998.
SELECT
customer_id, review_date, review_rating, product_id, product_title
FROM
customer_reviews
WHERE
customer_id ='A27T7HVDXA3K2A' AND
product_title LIKE '%Dune%' AND
review_date >= '1998-01-01' AND
review_date <= '1998-12-31';
-- Do we have a correlation between a book's title's length and its review ratings?
SELECT
width_bucket(length(product_title), 1, 50, 5) title_length_bucket,
round(avg(review_rating), 2) AS review_average,
count(*)
FROM
customer_reviews
WHERE
product_group = 'Book'
GROUP BY
title_length_bucket
ORDER BY
title_length_bucket;
Distributed ClusterWe now describe the steps involved in setting up CitusDB on a multiple node Hadoop cluster. To reduce the number of steps involved, we use an EC2 machine image that includes Hadoop 1.1, CitusDB 2.0, and our metadata syncing application. Alternatively, you can use the Hadoop and CitusDB setup guides to download these packages and install from there. For our example cluster, we assume a four node setup. One of these nodes act as the HDFS NameNode, and the others act as the DataNodes. We use the terminology node-name-100 to refer to the NameNode's hostname; and for simplicity assume that the CitusDB master node lives on name-node-100 in our setting. In production deployments however, we advise against running the NameNode and the CitusDB master on the same machine.
localhost# ssh -i <private SSH key file> ec2-user@<name-node-100> name-node-100# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa name-node-100# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys name-node-100# ssh ec2-user@ip-data-node-101.ec2.internal name-node-100# ssh ec2-user@ip-data-node-102.ec2.internal name-node-100# ssh ec2-user@ip-data-node-103.ec2.internal
all-nodes# emacs -nw ~/hadoop-1.1.1/conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://ip-namenode-name-goes-here.ec2.internal:9000</value>
</property>
</configuration>
all-nodes# emacs -nw ~/hadoop-1.1.1/conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/opt/citusdb/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/citusdb/hdfs/data</value>
</property>
<property>
<name>dfs.block.local-path-access.user</name>
<value>ec2-user</value>
</property>
</configuration>
name-node-100# emacs -nw ~/hadoop-1.1.1/conf/slaves ip-data-node-100-goes-here.ec2.internal ip-data-node-101-goes-here.ec2.internal ip-data-node-102-goes-here.ec2.internal name-node-100# wget http://examples.citusdata.com/customer_reviews_1998.csv.gz name-node-100# wget http://examples.citusdata.com/customer_reviews_1999.csv.gz name-node-100# gzip -d customer_reviews_1998.csv.gz name-node-100# gzip -d customer_reviews_1999.csv.gz name-node-100# cd hadoop-1.1.1 name-node-100# bin/hadoop namenode -format name-node-100# bin/start-dfs.sh name-node-100# bin/hadoop fs -mkdir /user/data/reviews name-node-100# bin/hadoop fs -put ~/customer_reviews_1998.csv /user/data/reviews name-node-100# bin/hadoop fs -put ~/customer_reviews_1999.csv /user/data/reviews name-node-100# emacs -nw /opt/citusdb/2.0/data/pg_worker_list.conf # HOSTNAME [PORT] [RACK] ip-data-node-100-goes-here.ec2.internal ip-data-node-101-goes-here.ec2.internal ip-data-node-102-goes-here.ec2.internal all-nodes# /opt/citusdb/2.0/bin/pg_ctl -D /opt/citusdb/2.0/data -l logfile start
name-node-100# /opt/citusdb/2.0/bin/psql -h localhost -d postgres
postgres# CREATE EXTENSION file_fdw;
postgres# CREATE SERVER file_server FOREIGN DATA WRAPPER file_fdw;
postgres# CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT not null,
review_date DATE not null,
review_rating INTEGER not null,
review_votes INTEGER,
review_helpful_votes INTEGER,
product_id CHAR(10) not null,
product_title TEXT not null,
product_sales_rank BIGINT,
product_group TEXT,
product_category TEXT,
product_subcategory TEXT,
similar_product_ids CHAR(10)[]
)
DISTRIBUTE BY APPEND (review_date)
SERVER file_server
OPTIONS (filename '', hdfs_directory_path '/user/data/reviews', format 'csv');
name-node-100# cd ~/hadoop-sync name-node-100# cp target/classes/sync.properties . name-node-100# emacs -nw sync.properties # HDFS related cluster configuration settings HdfsMasterNodeName = ip-namenode-name-goes-here.ec2.internal HdfsMasterNodePort = 9000 HdfsWorkerNodePort = 50020 # CitusDB related cluster configuration settings CitusMasterNodeName = localhost CitusMasterNodePort = 5432 CitusWorkerNodePort = 5432 ec2-node# java -jar target/hadoop-sync-0.1.jar customer_reviews --fetch-min-max
-- Find all reviews a particular customer made on the Dune series in 1998.
SELECT
customer_id, review_date, review_rating, product_id, product_title
FROM
customer_reviews
WHERE
customer_id ='A27T7HVDXA3K2A' AND
product_title LIKE '%Dune%' AND
review_date >= '1998-01-01' AND
review_date <= '1998-12-31';
-- Do we have a correlation between a book's title's length and its review ratings?
SELECT
width_bucket(length(product_title), 1, 50, 5) title_length_bucket,
round(avg(review_rating), 2) AS review_average,
count(*)
FROM
customer_reviews
WHERE
product_group = 'Book'
GROUP BY
title_length_bucket
ORDER BY
title_length_bucket;
|