The Postgres 10 feature you didn't know about: CREATE STATISTICS

Written by Samay Sharma
March 6, 2018

If you’ve done some performance tuning with Postgres, you might have used EXPLAIN. EXPLAIN shows you the execution plan that the PostgreSQL planner generates for the supplied statement. It shows how the table(s) referenced by the statement will be scanned (using a sequential scan, index scan etc), and what join algorithms will be used if multiple tables are used. But, how does Postgres come up with these plans?

One very significant input to deciding which plan to use is the statistics the planner collects. These statistics allow the planner to estimate how many rows will be returned after executing a certain part of the plan, which then influences the kind of scan or join algorithm which will be used. They are collected / updated mainly by running ANALYZE or VACUUM (and a few DDL commands such as CREATE INDEX).

These statistics are stored by the planner in pg_class and in pg_statistics. Pg_class basically stores the total number of entries in each table and index, as well as the number of disk blocks occupied by them. Pg_statistic stores statistics about each column like what % of values are null for the column, what are the most common values, histogram bounds etc. You can see an example below for the kind of statistics Postgres collected for col1 in our table below. The query output below shows that the planner (correctly) estimates that there are 1000 distinct values for the column col1 in the table and also makes other estimates on most common values, frequencies etc.

Note that we’ve queried pg_stats (a view holding more readable version of the column statistics.)

CREATE TABLE tbl (
    col1 int,
    col2 int
);

INSERT INTO tbl SELECT i/10000, i/100000
FROM generate_series (1,10000000) s(i);

ANALYZE tbl;

select * from pg_stats where tablename = 'tbl' and attname = 'col1';
-[ RECORD 1 ]----------+----------------------------------------------
schemaname             | public
tablename              | tbl
attname                | col1
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 1000
most_common_vals       | {318,564,596,...}
most_common_freqs      | {0.00173333,0.0017,0.00166667,0.00156667,...}
histogram_bounds       | {0,8,20,30,39,...}
correlation            | 1
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

When single column statistics are not enough

These single column statistics help the planner in estimating the selectivity of your conditions (this is what the planner uses to estimate how many rows will be selected by your index scan). When multiple conditions are supplied in the query, the planner assumes that the columns (or the where clause conditions) are independent of each other. This doesn’t hold true when columns are correlated or dependant on each other and that leads the planner to under or over-estimate the number of rows which will be returned by these conditions.

Let’s look at a few examples below. For keeping the plans simple to read, we’ve switched off per-query parallelism by setting setting max_parallel_workers_per_gather to 0;

EXPLAIN ANALYZE SELECT * FROM tbl where col1 = 1;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on tbl  (cost=0.00..169247.80 rows=9584 width=8) (actual time=0.641..622.851 rows=10000 loops=1)
   Filter: (col1 = 1)
   Rows Removed by Filter: 9990000
 Planning time: 0.051 ms
 Execution time: 623.185 ms
(5 rows)

As you can see here, the planner estimates that the number of rows which have value 1 for col1 are 9584 and that the actual number of rows which the query returns is 10000. So, pretty accurate.

But, what happens when you include filters on both column 1 and column 2.

EXPLAIN ANALYZE SELECT * FROM tbl where col1 = 1 and col2 = 0;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on tbl  (cost=0.00..194248.69 rows=100 width=8) (actual time=0.640..630.130 rows=10000 loops=1)
   Filter: ((col1 = 1) AND (col2 = 0))
   Rows Removed by Filter: 9990000
 Planning time: 0.072 ms
 Execution time: 630.467 ms
(5 rows)

The planner estimate is already off by 100x! Let’s try to understand why that happened.

The selectivity for the first column is around 0.001 (1/1000) and the selectivity for the second column is 0.01 (1/100). To calculate the number of rows which will be filtered by these 2 “independent” conditions, the planner multiplies their selectivity. So, we get:

Selectivity = 0.001 * 0.01 = 0.00001.

When that is multiplied by the number of rows we have in the table i.e. 10000000 we get 100. That’s where the planner’s estimate of 100 is coming from. But, these columns are not independent, how do we tell the planner that?

CREATE STATISTICS in PostgreSQL

Before Postgres 10, there wasn’t an easy way to tell the planner to collect statistics which capture this relationship between columns. But, with Postgres 10, there’s a new feature which is built to solve exactly this problem. CREATE STATISTICS can be used to create extended statistics objects which tell the server to collect extra statistics about these interesting related columns.

Functional dependency statistics

Getting back to our previous estimation problem, the issue was that the value of col2 is actually nothing but col 1 / 10. In database terminology, we would say that col2 is functionally dependent on col1. What that means is that the value of col1 is sufficient to determine the value of col2 and that there are no two rows having the same value of col1 but different values of col2. Therefore, the 2nd filter on col2 actually doesn’t remove any rows! But, the planner capture enough statistics to know that.

Let’s create a statistics object to capture functional dependency statistics about these columns and run ANALYZE.

CREATE STATISTICS s1 (dependencies) on col1, col2 from tbl;
ANALYZE tbl;

Let’s see what the planner comes up with now.

EXPLAIN ANALYZE SELECT * FROM tbl where col1 = 1 and col2 = 0;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on tbl  (cost=0.00..194247.76 rows=9584 width=8) (actual time=0.638..629.741 rows=10000 loops=1)
   Filter: ((col1 = 1) AND (col2 = 0))
   Rows Removed by Filter: 9990000
 Planning time: 0.115 ms
 Execution time: 630.076 ms
(5 rows)

Much better! Let’s look at what helped the planner make that determination.

SELECT stxname, stxkeys, stxdependencies
  FROM pg_statistic_ext
  WHERE stxname = 's1';
stxname | stxkeys |   stxdependencies
---------+---------+----------------------
 s1      | 1 2     | {"1 => 2": 1.000000}
(1 row)

Looking at this, we can see that Postgres realizes that col1 fully determines col2 and therefore has a coefficient of 1 to capture that information. Now, all queries with filters on both these columns will have much better estimates.

ndistinct statistics

Functional dependency is one kind of relationship you can capture between the columns. Another kind of statistic you can capture is number of distinct values for a set of columns. We earlier noted that the planner captures statistics for number of distinct values for each column, but again those statistics are frequently wrong when combining more than one column.

When does having bad distinct statistics hurt me? Lets look at an example.

EXPLAIN ANALYZE SELECT col1,col2,count(*) from tbl group by col1, col2;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=1990523.20..2091523.04 rows=100000 width=16) (actual time=2697.246..4470.789 rows=1001 loops=1)
   Group Key: col1, col2
   ->  Sort  (cost=1990523.20..2015523.16 rows=9999984 width=8) (actual time=2695.498..3440.880 rows=10000000 loops=1)
         Sort Key: col1, col2
         Sort Method: external sort  Disk: 176128kB
         ->  Seq Scan on tbl  (cost=0.00..144247.84 rows=9999984 width=8) (actual time=0.008..665.689 rows=10000000 loops=1)
 Planning time: 0.072 ms
 Execution time: 4494.583 ms

When aggregating rows, Postgres chooses to do either a hash aggregate or a group aggregate. If it can fit the hash table in memory, it choose hash aggregate, otherwise it chooses to sort all the rows and then group them according to col1, col2.

Now, the planner estimates that the number of groups (which is equal to the number of distinct values for col1, col2) will be 100000. It sees that it doesn’t have enough work_mem to store that hash table in memory. So, it uses a disk-based sort to run the query. However, as you can see in the actual section of the plan, the number of actual rows are only 1001. And maybe, we had enough memory to fit them in memory, and do a hash aggregation.

Let’s ask the planner to capture n_distinct statistics, re-run the query and find out.

CREATE STATISTICS s2 (ndistinct) on col1, col2 from tbl;
ANALYZE tbl;

EXPLAIN ANALYZE SELECT col1,col2,count(*) from tbl group by col1, col2;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=219247.63..219257.63 rows=1000 width=16) (actual time=2431.767..2431.928 rows=1001 loops=1)
   Group Key: col1, col2
   ->  Seq Scan on tbl  (cost=0.00..144247.79 rows=9999979 width=8) (actual time=0.008..643.488 rows=10000000 loops=1)
 Planning time: 0.129 ms
 Execution time: 2432.010 ms
(5 rows)

You can see that the estimates are now much more accurate (i.e. 1000), and the query is now around 2x faster. We can see what the planner learned by running the query below.

SELECT stxkeys AS k, stxndistinct AS nd
  FROM pg_statistic_ext
  WHERE stxname = 's2';
  k  |       nd
-----+----------------
 1 2 | {"1, 2": 1000}

Real-world implications

In actual production schemas, you invariably have certain columns which have dependencies or relationships with each other which the database doesn't know about. Some examples we’ve seen with Citus open source and Citus customers in the cloud are:

  • Having columns for month, quarter and year because you want to show statistics grouped by all in reports.
  • Relationships between geographical hierarchies. Eg. having country, state and city columns and filtering / grouping by them.

The example here has only 10M rows in the dataset and we already see that using CREATE statistics improves plans significantly in cases where there are correlated columns and also shows performance improvements. In Citus use cases, we have users storing billions of rows of data and the implications of bad plans can be drastic. In our example when the planner chose a bad plan we had to do a disk based sort for 10M rows, imagine how bad it would have been with billions of rows.

Postgres keeps getting better and better

When we set out to build Citus we explicitly chose Postgres as the foundation to build on. By extending Postgres we chose a solid foundation that continues to get better with each release. Because Citus is a pure extension and not a fork all the great new features that come out in each release you get to take advantage of when using Citus.

Samay Sharma

Written by Samay Sharma

Former open source Postgres engineering team manager at Microsoft. Citus and Postgres software developer and technical writer. CS degree from IIIT Hyderabad. Bollywood movie and dance lover.