You can scale out your analytics and start querying terabytes of data immediately with CitusDB

Metin Doslu | 29th October, 2012

After our earlier post on running SQL on MongoDB, we received many questions about the performance implications of our approach. In that approach, we used a foreign data wrapper that converted BSON documents to their PostgreSQL equivalents at query time. This way, users who had their data in MongoDB could run complex analytic queries, join different collections, and benefit from SQL ecosystem tools -- all without having to load any data into PostgreSQL.

Naturally, running SQL without loading any data has a performance overhead. To quantify this cost, we used the industry standard analytics benchmark TPC-H; and compared query run times on regular PostgreSQL tables to processing BSON documents using mongo_fdw. In all our tests, we ran the queries once to warm up the cache, and then averaged the results of three consecutive runs.

TPC-H Queries PostgreSQL CitusDB (1 worker)
(in seconds) Regular mongo_fdw Regular mongo_fdw
Query 1 318.3 742.5 55.7 157.6
Query 3 41.5 275.4 12.4 80.6
Query 5 37.0 (2 hrs) 11.5 105.0
Query 6 21.1 72.4 5.5 21.2
Query 10 57.0 157.7 38.6 61.1
Query 12 34.5 326.6 15.0 54.5
Query 14 21.0 68.2 9.1 26.5
Query 19 28.9 147.4 10.1 96.3

A key point to note is that these are all analytics queries that take between 20-300 seconds to complete when all the data are stored in PostgreSQL. Using foreign data wrappers instead, the same queries complete in 70-700 seconds on a single CPU core. And with multiple cores (for instance, with CitusDB on 1 worker node), query performance increases further.

Now, a couple of notes about our test setup. We used TPC-H to generate 10 GB of raw data in tabular format. We realize MongoDB collections can capture more complex relations, but our goal here was to measure the performance overhead of converting documents using a well-established benchmark.

We also ran two sets of separate experiments. In one, we compared PostgreSQL regular table performance to reading and converting documents from MongoDB. In the other, we used a CitusDB cluster of one worker node, and again compared the performance of regular tables to mongo_fdw. The only difference between the two experiments was that we used a sharded MongoDB in the second test; that way CitusDB could partition SQL queries and make use of all CPU cores available on the machine.

In all test scenarios, we used an m2.2xlarge instance from EC2. This instance has 4 CPU cores and 34 GB of RAM; we made sure that all TPC-H data fit into memory so that we became CPU bound. Further, we sharded lineitem and orders collections into 4 shards in the second set of tests to use all available CPU cores.

We also didn't explicitly create any indexes on tables or collections, and ran ANALYZE on regular tables. In the PostgreSQL configuration, we increased shared_buffers to 2GB and work_mem to 64MB to inform the database about the additional memory available to it.

Given this setup, we have three observations:

  • Queries that are already compute heavy, such as query #1, only slow down by 2x in relative performance numbers. However, absolute query performance also takes the most hit in this case: mongo_fdw needs to read 95% of the documents from MongoDB and convert most fields into their column equivalents.
  • When the query isn't compute heavy, such as in query #6, the computation overhead of reading and converting documents becomes more visible. Query #6 slows down by 4.5x. Still, this query can go over 800K documents per second per CPU core, as most documents are filtered by the MongoDB server and only a few columns need to be deserialized.
  • Database statistics become relevant when the query involves joins between several tables. In version 9.1, PostgreSQL doesn't collect statistics on foreign tables, and therefore chooses two very different execution plans for query #5. As a result, this query doesn't complete in the PostgreSQL test for mongo_fdw. (On CitusDB, the results are better as lineitem and orders collections are partitioned into smaller shards). Fortunately, the new version of PostgreSQL solves this problem.

Naturally, many other optimizations exist to improve mongo_fdw's performance. Still, the primary benefit of foreign tables is their flexibility. With them, you can issue complex SQL queries on log files, JSON data, or MongoDB. In fact, you can even join data from different sources without having to load any of it.

And with CitusDB, you can scale out your analytics and start querying terabytes of data immediately. If this sounds interesting, please keep in touch by following us on Twitter @citusdata.