Don’t let collation versions corrupt your PostgreSQL indexes

This post by Postgres committer Thomas Munro about collation versioning was originally published on the Azure Database for PostgreSQL Blog on Microsoft TechCommunity.

[UPDATE in Sep 2021]: This blog post was originally written during the PostgreSQL 14 development cycle. The feature discussed is now a candidate for PostgreSQL 15 and the text has been updated to reflect this.

As part of my work on the open source PostgreSQL team at Microsoft, I've been developing a new feature for PostgreSQL to track dependencies on collation versions, with help from co-author Julien Rouhaud and many others who have contributed ideas. It's taken a long time to build a consensus on how to tackle this thorny problem (work I began at EnterpriseDB and continued at Microsoft), and you can read about some of the details and considerations in the commit message below and the referenced discussion thread. We're not quite done with that yet. It was originally planned for PostgreSQL 14, but some unhandled complications arose so this project is back in the workshop.

commit 257836a75585934cc05ed7a80bccf8190d41e056
Author: Thomas Munro <tmunro@postgresql.org>
Date:   Mon Nov 2 19:50:45 2020 +1300

    Track collation versions for indexes.

    Record the current version of dependent collations in pg_depend when
    creating or rebuilding an index.  When accessing the index later, warn
    that the index may be corrupted if the current version doesn't match.

    Thanks to Douglas Doole, Peter Eisentraut, Christoph Berg, Laurenz Albe,
    Michael Paquier, Robert Haas, Tom Lane and others for very helpful
    discussion.

    Author: Thomas Munro <thomas.munro@gmail.com>
    Author: Julien Rouhaud <rjuju123@gmail.com>
    Reviewed-by: Peter Eisentraut <peter.eisentraut@2ndquadrant.com> (earlier versions)
    Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com

In this article I'll talk about the problem we need to solve—that PostgreSQL indexes can get corrupted by changes in collations that occur naturally over time—and how the new feature will make things better in a future version of PostgreSQL. Plus, you’ll get a bit of background on collations, too.

What is a collation?

The term derives from a form of the Latin verb conferre meaning to confer, bring together, or compare. Book binders use it to mean putting pages into the correct order. Outside special technical fields, it may be better known as an old fashioned and formal name for a cold meal (cf. colazione, Italian for breakfast). That meaning comes to us from the same ultimate source, having developed a different meaning in 4th century monasteries through association with brown bag lunches, but I digress…

Many software systems including POSIX and Windows have locales that encapsulate language and cultural conventions. A collation is the component of a locale that controls the way text is sorted. Even though many languages share common writing systems, they don't necessarily agree on how to sort words. Even within the same language, there can be differences in how words should be sorted in different countries or in specialised contexts.

Similarly, in the context of SQL, a COLLATION is a database object that identifies a set of rules for sorting character strings.

For example, in a Danish dictionary, "å" comes after "z", and "aa" is considered to be an alternative spelling of "å". PostgreSQL models the different possible sorting rules with different collations, and we can see this difference on most systems like so:

postgres=# create table words (word text);
CREATE TABLE
postgres=# insert into words values ('Aarhus'), ('Banana'), ('Cat');
INSERT 0 3
postgres=# select * from words order by word collate "en_NZ";
  word
--------
 Aarhus
 Banana
 Cat
(3 rows)

postgres=# select * from words order by word collate "da_DK";
  word
--------
 Banana
 Cat
 Aarhus
(3 rows)

This might affect where a user expects to find a value in drop-down list or contents page.

Collations are part of the static type system of SQL. If you're working with text, you can't escape them! You can mostly ignore them and just let your whole system use the collation named "default", though. That's how a lot of you probably consume collations, especially if you only deal with one language. That means you get the default for the database you're connected to:

postgres=# \l postgres
                              List of databases
   Name   | Owner  | Encoding |   Collate   |    Ctype    | Access privileges
----------+--------+----------+-------------+-------------+-------------------
 postgres | tmunro | UTF8     | en_NZ.UTF-8 | en_NZ.UTF-8 |
(1 row)

Often the default collation is set by initdb when the PostgreSQL cluster is first created. The collation and other locale components can be set using defaults from the operating system environment, and then inherited by databases created later. However, they can also be explicitly specified, and changed in later CREATE DATABASE commands with certain restrictions. You can also opt for one of the special collations "C", "POSIX", or "ucs_basic" which sort in binary or code point order rather than following cultural conventions, avoiding the problems described below—fair warning though, the results might surprise a few humans if they are reflected in user interfaces.

So what's the catch?

To quote from Unicode Technical Standard #10:

Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned.

The most obvious problem this causes for databases like PostgreSQL is that B-tree indexes can become corrupted if they are ordered by a collation that changes. Typical symptoms reported in the wild include rows that cannot be seen by one query but can be seen by another query that is using a different plan. Or sometimes, a unique index fails to prevent a duplicate value from being added to a table.

To see why collation changes could impact whether PostgreSQL is able to look things up, let's use a phone book analogy. After all, a phone book is an ordered index with a lot of similarities to indexes in databases. So, imagine trying to look up your own name in an old-fashioned paper phone book—then imagine the phone book was created by someone who believes the alphabet goes in a slightly different order than you. There's a chance you'll conclude your name is not in the phone book, because your name is ordered in some unexpected place in the book, effectively hidden from you.

For a simple real-world example of a change in sort order on GNU/Linux that involves only ASCII characters, the sort order of the strings 'a-a' and 'a+a' flipped in glibc 2.28 (Debian 10, RHEL 8) compared to earlier releases.

Where does PostgreSQL's collation logic come from?

PostgreSQL relies on external libraries to order strings. There are two "providers":

libc, meaning the operating system locale facility (POSIX or Windows)
icu, meaning the ICU project (if PostgreSQL was built with ICU support)

For now, only libc collations can be used as the database default, though ICU collations can be used explicitly, for example when defining the columns of a table. So which should you use?

One advantage of using libc collations is that the sorting behaviour matches other software on the same operating system.
Conversely, by using ICU it is possible to get software running on different operating systems to agree on sort order, if they're also using ICU.
Another reason to use ICU: some operating systems provide only very rudimentary collation logic in their libc, especially when using the UTF-8 encoding.
Another reason to use ICU is that it supports a rich system of end-user customisation (see some simple examples in the PostgreSQL manual).

While the ICU project's collation rules are based strictly on the Unicode Collation Algorithm and the ICU library version can be matched up with specific versions of Unicode and CLDR, things are a little murkier for libc, depending on your operating system. Many operating systems including FreeBSD, Windows 10, and GNU libc have migrated or are in the process of migrating from their own historical implementations of sorting rules to UCA-based logic and CLDR, so that eventually there will be fewer differences between systems, and between libc and ICU collations. Ironically it is this process that has caused many of the changes that we've seen in the past few years in libc collation definitions on popular operating systems. Even if all implementations eventually converged, CLDR itself is a moving target and it's possible that the versions aren't in sync.

The available set of collations on your system can be seen in the pg_collation catalog. You can add more with the CREATE COLLATION command, but you need to know how to form collcollate strings that are meaningful to your libc (see your operating system documentation) or to ICU as appropriate. Note that collations often have names like "en_US", which requires double quotes when you reference it in a COLLATE expression.

Where does collation version information come from?

The ICU library can report an opaque version for the collations it provides. In PostgreSQL releases 10 to 13, versions of ICU collations were already captured in the pg_collation catalog table when they are first created. Typically that's when the cluster is first set up. Then, when you use the collation (for example, by accessing an index that is ordered by it), a check is performed, and a warning is produced. This would alert you to rebuild all relevant indexes and then run ALTER COLLATION xxx REFRESH VERSION when done. This was a good start but it didn't help many people for several reasons:

it didn't work for libc collations (by far the more popular provider)
it was difficult to see how to extend this model to cover the "default" collation (by far the most popular way to use collations, even if users may be unaware they're doing it!)
it didn't know which indexes needed to be rebuilt, and relied on a human to do that

The first step to making things better and to tracking collation version dependencies was to figure out how to get version information from the operating system for libc collations. No relevant standard provides such an interface, but patches have now been committed to do that for GNU/Linux, FreeBSD 13+, and Windows using non-standard interfaces. Here are some example values, as returned by the pg_collation_actual_version() function in PostgreSQL 14-devel:

OS	Provider	Collation	Version
Linux	libc	en_NZ	2.28
FreeBSD	libc	en_NZ	34.0
Windows	libc	en_NZ	1538.14,1538.14
(All)	icu	en-NZ-x-icu	153.88

Patches for other operating systems are welcome! These version strings are not interpreted in any way by PostgreSQL, except to check if they have changed. Note that on Linux we fall back to the glibc library version, for lack of anything better.

What do we want to do with collations versions?

PostgreSQL tracks the dependencies between database objects in the pg_depend catalog. This is used for various purposes, including stopping you from dropping things that other things reference, dropping dependent things automatically, and dumping objects in the correct order.

The new refobjversion provides a place for us to record the version that was in effect at the time an index was created. For example:

postgres=# create table city (name text primary key);
CREATE TABLE

And now, after inserting some data, suppose we upgrade the operating system, or use a streaming replica server which has a newer operating system version:

postgres=# select * from city where name between 'Aarhus' and 'Antioch';
WARNING:  index "city_pkey" depends on collation "default" version "34.0", but the current version is "36.0"
DETAIL:  The index may be corrupted due to changes in sort order.
HINT:  REINDEX to avoid the risk of corruption.

This is just a warning, and will only be given once per index, per session. The primary way to stop the warning is with the REINDEX command, as mentioned in the HINT. It's also possible to tell the system that you have independently verified that the index has not been corrupted by the change in version with ALTER INDEX name ALTER COLLATION name REFRESH VERSION. One way to verify that is to use the amcheck module.

What else could we do with the new collation version infrastructure?

Currently, warnings advising you to rebuild indexes are raised, but you need to take action one by one to clear the warnings. It would be nice to have a new view and a new REINDEX variant to see and address all problems at once in later work; and the work is underway to do that.

So far I talked about indexes, because PostgreSQL indexes are the most likely source of problems when ordering functions change. Equivalent problems exist in other places, for example:

range partitions might now classify a key as belonging to a different partition after a collation change
check constraints that perform string comparisons might previously have passed, but now fail

While proof-of-concept patches have been drafted to provide similar warnings for those cases, it's a little less clear what the user should do about these warnings, so these proof-of-concept ideas remain under discussion.

Mutating IMMUTABLE functions

PostgreSQL requires every function or operator involved in the definition of an index to be marked IMMUTABLE. The case of mutating collations can be thought of as a violation of that requirement, because the string comparison function (for example, bttextcmp) fails to gaurantee that it'll return the same value given the same inputs for the rest of eternity. The same applies to user-defined functions that are declared IMMUTABLE, but don't meet that requirement due to, say, a dependency on the time, a GUC setting, or any other information that is not derivable only from the inputs.

There is a class of functions that change infrequently and at known times, however. Perhaps it would be possible to create a system of declarative versions for those, using the new refobjversion system: the index would depend on a named version of the function.

Other things that affect indexes

Consider the example of the unaccent function in PostgreSQL. A version for the unaccent() function itself would be a little too simplistic, because there is a second object besides the function that affects its results: the rule file, which can be modified or replaced by the user, either directly or by supplying a pathname as an extra argument to the function.

Another example of functions that are used in indexes but depend on objects that might change is the full text search system. The dictionaries, stopwords and stemmers change over time. In theory, some kind of system could be devised to cover those objects too.

TL;DR

The plan is that—once this feature is committed into a future version of PostgreSQL—PostgreSQL will warn you about collation changes that might otherwise cause subtle, silent corruption such as failure to look up keys or prevent duplicate keys. This has been a persistent source of user complaint over the years; the proposed new warnings for PostgreSQL should help you avoid the risks by alerting you to changes in collation versions, either when upgrading machines or moving databases between machines.

Collations and dependencies are nerdy topics—a bit like leap seconds and time zones: you know, the kinds of obscure things you don't normally have to worry about until something is broken. If you like gory details, the UCA is both fascinating and terrifying (my favourite part is "backward accent ordering" which shows up in some French variants but not others).

All of these things—time zones, leap seconds, collations—are gnarly details of the real world around us. And they can, and do, change. Collations matter to PostgreSQL because people want to be able to use natural language strings in database indexes (and use these indexes to drive user-visible ordering.) Hence, PostgreSQL needs to be able to cope with the fact that the underlying collation rules change over time—and that the providers are external libraries we don't control.

Thanks for reading. I hope you found this deep dive into the proposed collation version tracking feature interesting!

Written by Thomas Munro

Open source software developer focusing on PostgreSQL at Microsoft. Committer on the PostgreSQL and FreeBSD projects. Interested in open hardware, software, protocols, and standards. Speaker at conferences including PGCon, PGConfdev, PGDU, FOSDEM, Citus Con, & keynote speaker at POSETTE: An Event for Postgres 2024.