Run SQL on MongoDB. Wait, say what?

MongoDB and PostgreSQL are two of the most popular databases in the open source world. One neat idea that relates to these databases and that recently became possible is: keeping all your data in MongoDB, while still running the full spectrum of SQL queries supported by Postgres. This way, you can run complex analytic queries, join different collections, and benefit from SQL data visualization tools.

The feature that makes all this possible is foreign data wrappers. These wrappers enable running SQL on an external data source, given that the wrapper knows how to convert external data from its original form into PostgreSQL's tuple format. Some example wrappers already exist for CSV files, MySQL, and Twitter APIs.

We wanted to enable this functionality for MongoDB, and wrote an efficient wrapper named mongo_fdw. In the following, we describe this wrapper's use with an example setup that employs customer reviews data from Amazon. We also assume for brevity that PostgreSQL and mongo_fdw are already installed on the system. To download these packages and example datasets from S3, you can refer to our documentation page; or you can also get the source code for PostgreSQL 9.1 and mongo_fdw from GitHub and compile from there.

Now, let's start by connecting to the MongoDB server through the database shell, and issue a query to see an example document from the customer_reviews collection. Overall, we have 1.75M similar documents in this collection.

> printjson(db.customer_reviews.findOne())
{
    "_id" : ObjectId("506d7bc18cd5e1d0385f36a5"),
    "customer_id" : "A2SWPY9CHUZCJM",
    "review_date" : ISODate("1998-01-03T00:00:00Z"),
    "review_rating" : 5,
    "review_votes" : 0,
    "review_helpful_votes" : 0,
    "product_id" : "0553377876",
    "product_title" : "Half Asleep in Frog Pajamas",
    "product_group" : "Book",
    "product_category" : "Literature & Fiction",
    "product_subcategory" : "General",
    "similar_product_ids" : [ "055337933X", "0553377884", "0553803328" ]
}

Then, let's connect to the PostgreSQL database, and create a foreign table for this collection.

localhost# /opt/citusdb/1.0/bin/psql -h localhost -d postgres

postgres=# CREATE EXTENSION mongo_fdw;
postgres=# CREATE SERVER mongo_server FOREIGN DATA WRAPPER mongo_fdw
           OPTIONS (address '127.0.0.1', port '27017');

postgres=# CREATE FOREIGN TABLE customer_reviews
(
    customer_id TEXT,
    review_date TIMESTAMP,
    review_rating INTEGER,
    product_id CHAR(10),
    product_title TEXT,
    product_group TEXT,
    product_category TEXT,
    similar_product_ids CHAR(10)[]
)
SERVER mongo_server
OPTIONS (database 'test', collection 'customer_reviews');

These commands create a new server and foreign table, and associate that foreign table with the customer_reviews collection in MongoDB. In here, we also specify the default option values in OPTIONS clauses. In practice, you only need to pass these in if your option values differ from these defaults.

After creating this foreign table, we are now ready to run SQL queries on it. In the following, we show a rather simple query, but complex queries that involve sub-selects, SQL window functions, and collection joins are also possible. In fact, you can even join a MongoDB collection with a PostgreSQL table.

SELECT
    round(avg(review_rating), 2),
    width_bucket(length(product_title), 1, 50, 5) as title_length,
    count(*)
FROM
    customer_reviews
WHERE
    product_group='Book' AND
    review_date >= '1998-01-01' AND
    review_date < date '1998-01-01' + interval '1 year'
GROUP BY
    title_length
ORDER BY
    title_length;

Behind the covers, mongo_fdw first takes all filtering expressions in the WHERE clause, and converts these expressions into their equivalents in a Mongo query. The wrapper then sends this query to MongoDB, fetches all documents that fit the query criteria, and converts these documents into PostgreSQL's internal tuple format. The PostgreSQL query executor takes care of the rest.

[conn2] query test.customer_reviews query: { product_group: "Book", review_date: 
{ $gte: new Date(883612800000), $lt: new Date(915148800000) } }
[conn2] getmore test.customer_reviews query: { product_group: "Book", review_date: 
{ $gte: new Date(883612800000), $lt: new Date(915148800000) } } 
cursorid:8599789499140355936
...

This particular SQL query completes in about 3 seconds. We have room for improvement here, particularly if we start directly reading MongoDB data files from disk. Further, we have several ideas around reaching into nested BSON objects and allowing upper-case letters in column names.

And the really exciting part for us is that CitusDB can easily enable analytic queries for sharded MongoDB installations, as it is built on PostgreSQL and can already distribute SQL queries. We have a few projects in the works around this idea and more, so please get in touch with us at engage @ citusdata.com to tell us what you think!

Got comments? Join the discussion on Hacker News.

About

CitusDB is a scalable analytics database that's built on top of PostgreSQL.

In this blog, we share our ideas and experiences on databases and distributed systems.