POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Hadi Moshayedi
May 30, 2013
We keep our log data in JSON files. Can we run analytic queries on these files without having to load them into a database?
We hear this question frequently from customers, and are finally happy to say yes. Without loading any data into a database, you can now quickly analyze your semi-structured data, and even make use of standard SQL data visualization tools. The key component that makes this possible is foreign data wrappers in Postgres. These wrappers enable running SQL on an external data source, given that the wrapper knows how to convert external data from its original form into PostgreSQL's tuple format.
We wanted to enable this functionality for JSON files, and implemented a wrapper named jsonfdw. We made sure jsonfdw handled nested fields, array types, and heterogeneous documents, and then open sourced the project. The source code and build instructions for json_fdw are available through GitHub; and the wrapper depends on PostgreSQL 9.2 for its SQL capabilities.
Now, let's see how this all fits together with an example. For this, we start by downloading customer reviews data from Amazon for the year 1998 in nested JSON format.
$ wget http://examples.citusdata.com/customer_reviews_nested_1998.json.gz
We have about 600K documents in this dataset. Let's take a look at an example document in this file:
$ gunzip -c customer_reviews_nested_1998.json.gz | head -n 1
{ "customer_id": "AE22YDHSBFYIP", "review": { "date": "1970-12-30", "rating": 5,
"votes": 10, "helpful_votes": 0 }, "product": { "id": "1551803542", "title":
"Start and Run a Coffee Bar (Start & Run)", "sales_rank": 11611, "group": "Book",
"category": "Business & Investing", "subcategory": "General", "similar_ids":
["0471136174","0910627312","047112138X","0786883561","0201570483"] } }
Now, let's connect to the PostgreSQL database, and create a foreign table for this JSON file.
$ /opt/citusdb/2.0/bin/psql -h localhost -d postgres
postgres=# CREATE EXTENSION json_fdw;
postgres=# CREATE SERVER json_server FOREIGN DATA WRAPPER json_fdw;
postgres=# CREATE FOREIGN TABLE customer_reviews
(
customer_id TEXT,
"review.date" DATE,
"review.rating" INTEGER,
"product.id" CHAR(10),
"product.group" TEXT,
"product.title" TEXT,
"product.similar_ids" CHAR(10)[]
)
SERVER json_server
OPTIONS (filename '/home/citusdata/customer_reviews_nested_1998.json.gz');
The first two commands define json_fdw in the database. The last command declares a foreign table schema that gets interpreted at read-time. Some other points to highlight here:
Now that we have the foreign table created, we can run SQL queries on it. In the following, we show a rather simple query, but complex queries that involve sub-selects, SQL window functions, and even joins against your regular Postgres tables are possible.
-- Look for a correlation between book titles and review ratings in our JSON file
SELECT
width_bucket(length("product.title"), 1, 50, 5) title_length_bucket,
round(avg("review.rating"), 2) AS review_average,
count(*)
FROM
customer_reviews
WHERE
"product.group" = 'Book'
GROUP BY
title_length_bucket
ORDER BY
title_length_bucket;
This particular SQL query completes in about 6 seconds on a 3.1 GHz CPU core. This includes reading documents from the compressed JSON file, parsing fields from those documents, and handing them to Postgres for query execution.
We ran similar queries on the entire customer reviews dataset using a 3.1 GHz CPU core. Regular JSON files were 1.8 GB in size and fit into memory. Average document size was 300 bytes, and jsonfdw processed 122K documents/second. When compressed, the file size dropped by 80% down to 325 MB. In this case, jsonfdw's processing speed slightly slowed down to 108K documents/second.
We know that these numbers are just approximates; and we're sharing them here to give a rough idea about performance. Typically with analytic queries, a lot depends on the query itself and the underlying documents. For example, we found that as JSON documents get bigger in size, processing speeds decrease.
Of course, if query speeds are important to you, you could use CitusDB binaries, and "stage" data into a distributed foreign table. With this approach, you can also have the database automatically collect statistics about the underlying data, and apply query optimizations such as partition pruning.
For more details on query distribution, or any type of feedback you have on json_fdw, please get in touch with us at engage @ citusdata.com and tell us what you think!