If you want to learn more about Citus on Microsoft Azure, read this post about Hyperscale (Citus) on Azure Database for PostgreSQL.
Materialized views were a long awaited feature within Postgres for a number of years. They finally arrived in Postgres 9.3, though at the time were limited. In Postgres 9.3 when you refreshed materialized views it would hold a lock on the table while they were being refreshed. If your workload was extremely business hours based this could work, but if you were powering something to end-users this was a deal breaker. In Postgres 9.4 we saw Postgres achieve the ability to refresh materialized views concurrently. With this we now have fully baked materialized view support, but even still we’ve seen they may not always be the right approach.
What is a view?
For those of you that aren’t database experts we’re going to backup a little bit. To know what a materialized view is we’re first going to look at a standard view. A view is a defined query that you can query against as if it were a table. Views are especially helpful when you have complex data models that often combine for some standard report/building block. We’ll look at an example in just a moment as we get to a materialized views.
Views are great for simplifying copy/paste of complex SQL. The downside is that each time a view is executed the results are recomputed. For large datasets this can cause scanning of a lot of data, invalidate your cache, and in general just be slow. Enter materialized views
Materializing your views
Let’s start with an example schema that could contain a lot of raw data. In this case a very basic web analytics tool that records pageview, the time it occurred, and the session id of the user.
CREATE TABLE pageviews ( id bigserial, page text, occurred_at timestamptz, session_id bigint );
There are a number of different views that could be very common based on this raw data. And if we have a real-time dashboard we’re powering it can quickly become unfeasible to query this raw data as a query would take too long. Instead we can do some rollups with materialized views:
CREATE MATERIALIZED VIEW rollups AS SELECT date_trunc('day') as day, page, count(*) as views FROM pageviews GROUP BY date_trunc('day'), page;
This would give us 1 record each day for each page that had at least 1 page view.
For things that batch each night this is fine to process things for the day before. But for something customer facing you may not want to wait until the end of the day before you provide analytics on how your pageviews are progressing. You can of course just refresh this at regular intervals:
refresh materialized view rollups;
The downside to refreshing this way is each time you refresh you’re recomputing the totals for the day, essentially doing unnecessary processing.
Incremental roll-ups for scalability
An alternative approach which will allow us to incrementally rollup data without having to re-process all the underlying data is to use an upsert. Upsert is essentially create or update. To do this first we’ll create a table instead of our materialized view, and then we’ll put a unique constraint on it:
CREATE TABLE ( day as timestamptz, page text, count as bigint, constraint unq_page_per_day unique (day, page) );
Now to roll-up we would do something similar to:
INSERT INTO rollups SELECT date_trunc('day') as day, page, count(*) as views FROM pageviews GROUP BY date_trunc('day'), page;
This is essentially the same as our materialized view. But with our unique constraint the insert would error when we encounter a record we’ve already inserted. In order to make this work we’re going to adapt our query to do two things. One we’re only going to process new records, and two we’re going to use the upsert syntax.
For processing new records we’re going to keep a record of where we last left off and only process the new ones. There is a handy set of functions/tables you can use to do which we outline in this post. With the function and table in place to track where we last left off we’ll now update our query to only aggregate since the last processed data. Then we’ll combine that with an upsert. The upsert will try to insert any new records for that day/page, and if already sees those values will increment them:
INSERT INTO rollups SELECT day, page, count(*) as views FROM pageviews WHERE event_id > e GROUP BY day, page ON CONFLICT (day, page) DO UPDATE SET views = views + EXCLUDED.views;
Materialized views vs. Rollup tables which is right?
Materialized views are a very simple and straight forward approach. Their ease of use makes them a good choice for something quick and easy. But for larger data sets and databases that have more active workloads only processing net new data from your last rollup can be a more efficient use or resources. Which is right for you depends on the time constraint of getting something in place vs. your system resources. Though as you can see the rollup approach is only slightly more effort and scales much further.