One of the neat features introduced in PostgreSQL 9.1 is foreign tables. This feature allows users to declare a foreign table for an external data source, and run queries on that table using SQL. The key component that enables this is a foreign data wrapper written for the external data source; this wrapper knows how to access the foreign data and convert this data from its original form into PostgreSQL's internal tuple format.
Given such a wrapper, users can run SQL queries on a diverse set of data; some example wrappers exist for MySQL, CSV files, and Twitter.
In fact, PostgreSQL already comes with a file_fdw module that lets users run SQL on flat files without having to load the data into the database. Naturally, this flexibility comes with a performance penalty; these files now need to be parsed from scratch every time a query is run.
We wanted to quantify this performance penalty and measure its impact on query execution times. For this, we used the industry standard analytics benchmark, TPC-H, and generated a dataset of size 10 GB. We then ran two sets of experiments. In one, we compared PostgreSQL regular tables to foreign tables. In the other, we used a CitusDB cluster of one worker node, and again compared the performance of regular and foreign tables. The only difference between the two experiments was that CitusDB partitioned both the data and the queries into smaller pieces; and could therefore make use of all CPU cores available on the machine.
In all test scenarios, we used an m2.2xlarge instance from EC2. This instance has 4 CPU cores and 34 GB of RAM; we made sure that all TPC-H data fit into memory so that we became CPU bound. We also made five slight changes to the default settings. First, all TPC-H data files used the delimiter character '|' as their column separator, so we specified that as an option when creating foreign tables.
Second, we partitioned lineitem and orders tables into 4 blocks in CitusDB to make use of all available CPU cores. Third, we ran ANALYZE on regular tables to have the database collect data distribution statistics. Fourth, the file_fdw module didn't support indexes, so we chose not to create indexes for regular tables either. Fifth, we increased the following configuration values to inform the database about the additional memory available to it.
shared_buffers = 2GB work_mem = 64MB
We next ran all TPC-H queries supported by CitusDB once to warm up the cache. Then, we ran TPC-H queries in sequence three times, and averaged these execution times.
|TPC-H Queries||PostgreSQL||Citus DB|
These performance numbers help us make some interesting observations. First, queries that are already compute heavy, such as query #1, only slow down by 2x. This is because the CPU overhead of parsing foreign data is dwarfed by the CPU cost of applying over a dozen aggregate functions for the query. However, when the query isn't compute heavy, such as in query #6, the computation overhead of parsing the data becomes much more visible. Query #6 slows down by over 10x.
A second interesting observation relates to the absolute CPU cost of parsing foreign data. Both query #1 and #6 parse and deserialize the entire lineitem table in a little over 200 seconds. This table occupies 7.2 GB in size, and thus one CPU core parses around 36 MB/s. If you are reading the data from memory, this doesn't mean much. However, if you are reading data from a hard disk, such as the one available on this instance, the parsing overhead will become negligible with two CPU cores parsing the data in parallel.
Third, database statistics become relevant when the query involves joins between several tables. In version 9.1, the database doesn't collect statistics for foreign tables, and therefore chooses two different execution plans for query #12. As a result, this query takes about 10 minutes to complete with foreign tables on PostgreSQL. (On CitusDB, the results are better as lineitem and orders tables are partitioned into smaller blocks.) Fortunately, the new version of PostgreSQL introduces functionality to gather statistics on foreign tables, and should generate much better plans.
In summary, our experiments show that parsing the foreign data slows down query execution times by 2x to 18x. Naturally, file_fdw can be improved further to cache the parsed data, deserialize only the needed columns, or create indexes in the background. In fact, the PostgresRaw implementation shows that by creating adaptive indexes on raw data, query execution performance can even surpass that of regular tables.
Still, the primary benefit of foreign tables is their flexibility. With them, you can issue complex SQL queries on log files, JSON data, or MySQL. In fact, you can even join data from different sources without having to load any of it. And with CitusDB, you can scale out your analytics and start querying terabytes of data immediately.
All of this really excites us. We have a couple of projects in the works for scalable data analytics on foreign data sources, so please get in touch with us at engage @ citusdata.com to learn more!
Got comments? Join the discussion on Hacker News.