Indexing all the things in Postgres

Postgres indexes make your application fast. And while one option is to analyze each of your relational database queries with pg_stat_statements to see where you should add indexes… an alternative fix (and a quick one at that) could be to add indexes to each and every database table—and every column—within your database. To make this easy for you, here’s a query you can run that will create the CREATE INDEX commands for every table and column in your Postgres database.

Disclaimer: Adding an index to every column of every table is not a best practice for production databases. And it’s certainly not something we recommend at Citus Data, where we take scaling out Postgres and database performance very seriously. But indexing all the things is a fun “what if” thought exercise that is well worth thinking about, to understand the value of indexes in Postgres. What if you indexed all the things?

With that disclaimer out of the way, onto my advice of how to index all the things as a way to understand your database’s performance.

What if you really did index all the things in your Postgres database?

Index all the things

Yes, we cheated and didn’t do every combination of every column, we thought this might be sufficient enough

SELECT 'CREATE INDEX ' || table_name || '_' || column_name || ' ON ' || table_name || ' ("' || column_name || '");' 
FROM information_schema.columns
WHERE table_schema = 'public'
  AND table_name != 'pg_stat_statements'
  AND table_name != 'pg_buffercache';

Based on our Citus Data multi-tenant tutorial you’d then get something like this as the result:

                                       ?column?
--------------------------------------------------------------------------------------
 CREATE INDEX companies_id ON companies ("id");
 CREATE INDEX companies_name ON companies ("name");
 CREATE INDEX companies_image_url ON companies ("image_url");
 CREATE INDEX companies_created_at ON companies ("created_at");
 CREATE INDEX companies_updated_at ON companies ("updated_at");
 CREATE INDEX campaigns_id ON campaigns ("id");
 CREATE INDEX campaigns_company_id ON campaigns ("company_id");
 CREATE INDEX campaigns_name ON campaigns ("name");
 CREATE INDEX campaigns_cost_model ON campaigns ("cost_model");
 CREATE INDEX campaigns_state ON campaigns ("state");
 CREATE INDEX campaigns_monthly_budget ON campaigns ("monthly_budget");
 CREATE INDEX campaigns_blacklisted_site_urls ON campaigns ("blacklisted_site_urls");
 CREATE INDEX campaigns_created_at ON campaigns ("created_at");
 CREATE INDEX campaigns_updated_at ON campaigns ("updated_at");
 CREATE INDEX ads_id ON ads ("id");
 CREATE INDEX ads_company_id ON ads ("company_id");
 CREATE INDEX ads_campaign_id ON ads ("campaign_id");
 CREATE INDEX ads_name ON ads ("name");
 CREATE INDEX ads_image_url ON ads ("image_url");
 CREATE INDEX ads_target_url ON ads ("target_url");
 CREATE INDEX ads_impressions_count ON ads ("impressions_count");
 CREATE INDEX ads_clicks_count ON ads ("clicks_count");
 CREATE INDEX ads_created_at ON ads ("created_at");
 CREATE INDEX ads_updated_at ON ads ("updated_at");
(24 rows)

If you need a schema just to test this one, you can use our multi-tenant tutorial as a starting point. Now in the case of the tutorial app and schema after adding the above indexes everything will be snappy. For larger production systems, you might consider the concurrent index version, which will be roughly 2-3 times slower in creation but won’t hold a long lock on the table while it’s creating. For the concurrent index version:

SELECT 'CREATE INDEX CONCURRENTLY ' || table_name || '_' || column_name || ' ON ' || table_name || ' ("' || column_name || '");' 
FROM information_schema.columns
WHERE table_schema = 'public'
  AND table_name != 'pg_stat_statements'
  AND table_name != 'pg_buffercache';

Indexing all the things in Postgres is a horrible idea, but…

Stepping back. For those unfamiliar, every index you add to your Postgres database will slow down your write throughput for the table it exists on.

A cool thing about Postgres, though, is that it will track all sorts of interesting data on when indexes are used and not. Similar to how we can get insights from pg_stat_statements about what queries are run and their corresponding performance, we can discover unused indexes and how large they are by looking at catalog tables.

With this query, we can see all of our indexes sorted by size. We can also see if there are unused indexes:

SELECT
  schemaname || '.' || relname AS table,
  indexrelname AS index,
  pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
  idx_scan as index_scans
FROM pg_stat_user_indexes ui
JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE NOT indisunique AND idx_scan < 50 AND pg_relation_size(relid) > 5 * 8192
ORDER BY pg_relation_size(i.indexrelid) / nullif(idx_scan, 0) DESC NULLS FIRST,
pg_relation_size(i.indexrelid) DESC;

This coupled with our first query can now let us easily index all columns, and just wait a few hours or days to see if the indexes are leveraged. Then we can come back and cleanup all the unused indexes. The result is you have a perfectly performing database.

If you prefer an approach to optimizing your database performance that is more reasonable than indexing all the things, consider this Postgres post as a starting point or give this talk by Andrew Kane on how Postgres could index itself a watch.

And if you’re running into scaling issues due to performance feel free to contact my team at Citus as we’d be happy to chat.