When to use unstructured datatypes in Postgres–Hstore vs. JSON vs. JSONB

Written by Craig Kerstiens
July 14, 2016

Since Postgres started supporting NoSQL (via hstore, json, and jsonb), the question of when to use Postgres in relational mode vs NoSQL mode has come up a lot. Do you entirely abandon traditional table structures, and go with documents all the way? Or do you intermingle both? The answer unsurprisingly is: it depends. Each newer model including hstore, JSON, and JSONB has their ideal use cases. Here we'll dig deeper into each and see when you should consider using them.

HStore

If you exclude XML, this was the first truly unstructured datatype to arrive in Postgres. Hstore arrived way back in Postgres 8.3, before upsert, before streaming replication, and before window functions. Hstore is essentially a key/value store directly in Postgres. With hstore you're a little more limited in terms of the datatypes you have: you essentially just get strings. You also don't get any nesting; in short it's a flat key/value datatype.

The upside of hstore is you don't have to define any of your keys ahead of time. You can simply insert the record and it'll save everything. Let's say you've got an example table:

CREATE TABLE products (
  id serial PRIMARY KEY,
  name varchar,
  attributes hstore
);

From here you can insert whatever you want into the attributes column. And then query based on those various keys or values.

INSERT INTO products (name, attributes) VALUES (
 'Geek Love: A Novel',
 'author    => "Katherine Dunn",
  pages     => 368,
  category  => fiction'
);

SELECT name, attributes->'author' as author
FROM products
WHERE attributes->'category' = 'fiction'

The obvious benefit here is flexibility, but where it really shines is being able to leverage various index types. In particular, a GIN or GiST index will index every key and value within the hstore. This way when you filter on something it'll use the index if it makes sense to the planner within Postgres.

As hstore isn't a full document equivalent, it's a stretch to consider using it as such. If you have relational data as well as some data that may not always exist on a column: it can be a great fit. In the most basic case attributes of a product catalog can be a great candidate. In certain categories such as books you'd have things like whether it's fiction or not; but in others such as clothes you might have things like size, and color. Having columns for every possible attribute for a product can at times very much be overkill.

JSON

When Postgres 9.2 arrived it was well received as the JSON release. Finally, Postgres can now complete against Mongo. (Although the JSON functionality in Postgres 9.2 was probably a little oversold.)

The JSON datatype in Postgres is under the covers still largely just a text field. With the JSON datatype what you do get is validation on it as it comes in though. Postgres does enforce that it's actually JSON. One small potential benefit of it over JSONB (which we'll get to next) is that it preserves the indentation of the data coming in. So if you are extremely particular about the formatting of your JSON, or have some need for it in a particular structure, JSON can be useful.

Furthermore, over time Postgres has picked up a number of niceties in the form of functions that can help. So, the question is: should you use JSON? At the end of the day, Postgres' JSON type simply provides JSON validation on a text field. If you're storing some form of log data you rarely need to query, JSON can work fine. Because it's so simple, it will have a lot higher write throughput. For anything more complex, I'd recommend using JSONB, which is covered below.

JSONB

Finally in Postgres 9.4 we got real and proper JSON in the form of JSONB. The B stands for better. JSONB is a binary representation of JSON, this means it's compressed and more efficient for storage than just text. It also has a similar plumbing of hstore underneath. In fact, once upon a time there was almost hstore2 and a separate JSON type, but the two converged into the JSONB we have today.

JSONB is largely what you'd expect from a JSON datatype. It allows nested structures, use of basic datatypes, and has a number of built in functions for working with it. Though the best part similar to hstore is the indexing. Creating a GIN index on a JSONB column will create an index on every key and value within that JSON document. That with the ability to nest within the document means JSONB is the superior to hstore in most cases.

That still leaves a bit of question of when to use only JSONB though. If you want a document database, instead of one of the other options out there you could go directly to Postgres. With a package like MassiveJS this can become quite seamless as well But even then, there are some clear examples where going more document heavy does make most sense. Some of the most common examples include:

  • Event tracking data, where you may want to include the payload in the event which might vary
  • Gaming data is especially common, especially where you have single player games and have a changing schema based on the state of the user
  • Tools that integrate multiple data sources, an example here may be a tool that integrates customers databases to Salesforce to Zendesk to something else. The mix of schemas makes doing this in a multitenant fashion more painful than it has to be.

Let's take a quick look at how the third example might work with JSONB. First lets create a table, and insert some example data:

CREATE TABLE integrations (id UUID, data JSONB);
INSERT INTO integrations VALUES (
  uuid_generate_v4(),
  '{
     "service": "salesforce",
     "id": "AC347D212341XR",
     "email": "craig@citusdata.com",
     "occurred_at": "8/14/16 11:00:00",
     "added": {
       "lead_score": 50
     },
     "updated": {
       "updated_at": "8/14/16 11:00:00"
     }
   }');
INSERT INTO integrations (
  uuid_generate_v4(),
  '{
     "service": "zendesk",
     "email": "craig@citusdata.com",
     "occurred_at": "8/14/16 10:50:00",
     "ticket_opened": {
       "ticket_id": 1234,
       "ticket_priority": "high"
     }
   }');

In the above case I could easily search for all events that have happened to craig@citusdata.com, then do some action. This could be some form of behavioral analytics such as find users that have done foo then bar, or a simple report such as find me all high scoring leads that have opened a ticket. By adding a GIN index: CREATE INDEX idx_integrations_data ON integrations USING gin(data); all of the data within my JSONB field is automatically indexed.

In conclusion

In most cases JSONB is likely what you want when looking for a NoSQL, schema-less, datatype. Hstore and JSON can have their place as well but it's less common. More broadly, JSONB isn't always a fit in every data model. Where you can normalize there are benefits, but if you do have a schema that has a large number of optional columns (such as with event data) or the schema differs based on tenant id then JSONB can be a great fit. In general you want:

  • JSONB - In most cases
  • JSON - If you're just processing logs, don't often need to query, and use as more of an audit trail
  • hstore - Can work fine for text based key-value looks, but in general JSONB can still work great here

Are you using any of Postgres schema-less datatypes? What type of workloads are you using it for? We'd love to hear about them @citusdata.

Craig Kerstiens

Written by Craig Kerstiens

Former Head of Cloud at Citus Data. Ran product at Heroku Postgres. Countless conference talks on Postgres & Citus. Loves bbq and football.