Mining for logic bugs in the Citus extension to Postgres with SQLancer

This post by Microsoft engineering intern Nazli Ugur Koyluoglu on her work with Citus and SQLancer was originally published on the Azure Database for PostgreSQL Blog.

Update in October 2022: Citus has a new home on Azure! The Citus database is now available as a managed service in the cloud as Azure Cosmos DB for PostgreSQL. Azure documentation links have been updated throughout the post, to point to the new Azure docs.

Many of you rely on databases to return correct results for your SQL queries, however complex your queries might be. And you probably place your trust with no questions asked—since you know relational databases are built on top of proven mathematical foundations, and since there is no practical way to manually verify your SQL query output anyway.

Since it is possible that a database’s implementation of the SQL logic could have a few errors, database developers apply extensive testing methods to avoid such flaws. For instance, the Citus open source repo on GitHub has more than twice as many lines related to automated testing than lines of database code. However, checking correctness for all possible SQL queries is challenging because of the lack of a "ground truth" to compare their outputs against, and the infinite number of possible SQL queries.

Even when we do know the result a SQL query is supposed to give and we get that result as expected, that does not mean the implementation is 100% correct in all possible variants of the query and input data. The challenge for every database team is in finding subtle edge cases where bugs might be introduced.

What are logic bugs in SQL, and why are they hard to detect?

There are 2 primary types of errors in databases:

"raised" errors
logic errors

"Raised" errors include syntax errors, panics, faults, and other crashes. Raised errors terminate the process abnormally or give you some signal of undesired behavior, making them immediately recognizable to you. In order to automatically detect (and then fix) raised errors, database providers widely use fuzzing tools for SQL. Fuzzing tools are random SQL query generators that strain the target database with complex commands until an error is raised.

Logic errors, on the other hand, are latent (or silent) errors that cause the database to produce inaccurate results without emitting any sign of unexpected behavior. Logic bugs are dangerous—neither the user, nor the database provider, nor the fuzzing tools might be aware of incorrect results being fetched from the database.

raised errors vs logic errors in databases — Fig. 1: "Raised" errors vs. logic errors

What if we could find a way to test the validity of SQL query output for a DBMS?

The recently launched open source SQLancer (Synthesized Query Lancer) tool gives you a way to test the validity of a database’s query responses. SQLancer is an automated Database Management System (DBMS) testing tool for detecting logic bugs. SQLancer’s testing approaches substitute the need for a ground truth in their correctness checks, by probing databases for self-consistency rather than agreement with some known behavior.

While SQLancer is still in the research prototype stage, it was released early on due to high demand from different companies, organizations, and developers. I first learned about SQLancer in June 2020 through a Quarantine 2020 Database Talk organized by Carnegie Mellon Professor of Databases Andy Pavlo.

The CMU talk by SQLancer creator Manuel Rigger was about Finding Logic Bugs in Database Management Systems. At the time of Manuel Rigger’s talk, SQLancer had already identified more than 175 bugs in 5 different SQL database management systems by using Ternary Logic Partitioning (TLP). Hence, Ternary Logic Partitioning became the SQLancer testing approach that was of most interest to our team at Citus.

If you are not familiar, Citus is an open source extension to Postgres that transforms Postgres into a distributed database. Since the Citus team was acquired by Microsoft last year, Citus is now available on Microsoft Azure as Hyperscale (Citus), a built-in deployment option on Azure Database for PostgreSQL.

During my software engineering internship the summer after my second year as an undergraduate at Stanford, I worked on the Citus open source team at Microsoft. My project was about incorporating correctness checks into our automated testing mechanisms, which led me to the CMU database talk on SQLancer. Keep reading to find out what excited us about SQLancer, how I developed an implementation of SQLancer that supports the Citus extension to Postgres, and the rewarding outcomes of my project.

SQLancer’s performance on Citus (& Postgres)

SQLancer proved itself useful for detecting logic bugs in Citus early on in my project, with the first error found within seconds of our initial SQLancer run after I had just finished implementing basic Citus support. And I found the first SQL logic bug within the first week.

By the end of my summer internship on the Postgres team at Microsoft, we had opened more than 10 issues in the Citus GitHub page for errors found by SQLancer, at least 3 of which qualify as SQL logic bugs. Moreover, I was also able to identify an error in PostgreSQL, which was then fixed by David Rowley and referenced by Thomas Munro—both of whom are PostgreSQL committers and part of the Postgres team at Microsoft.

Thanks @sqlancer_dbms!https://t.co/ZS4JostgzZ
— Thomas Munro (@MengTangmu) July 14, 2020

With the Citus implementation of SQLancer, we were able to test more than 3 million SQL queries in 3 days. While logic bugs in SQL databases are rare, the speed of the SQLancer tool and its broad, complex test space allowed us to detect logic errors that exist in the DBMS implementation (or that might be introduced in the future with the addition of new features). Whether part of Continuous Integration (CI) tests or part of regular lengthy background runs, incorporating SQLancer into the automated testing procedures for the Citus extension to PostgreSQL will significantly improve the reliability and stability of Citus for our users.

A technical overview of SQLancer’s testing approach: Ternary Logic Partitioning (TLP)

Among three of the test oracles adopted by SQLancer, my project focused on the Ternary Logic Partitioning (TLP) approach, which can be used to test WHERE, GROUP BY, HAVING clauses, aggregate functions, and DISTINCT queries. TLP compares the result sets of two semantically equivalent yet syntactically different SELECT queries that are expected to be equal.

Let’s make an apples analogy: Say, you want to fetch all apples. You could also say that you want to fetch all apples that are red, fetch all apples that are not red, or fetch all apples where the color is unknown. Both versions would be equivalent—the second one is only a more verbose way of saying the same thing by adding conditions related to "redness" and combining them in a way that makes the condition trivial.

For those of you who want to see the mathematical explanation for how TLP works, rather than rely on my apple analogy, here it goes:

Let Q be the original query. The semantically equivalent query Q′ would be a UNION of 3 queries that partition Q using ternary logic. These 3 queries are generated by randomly generating a boolean predicate φ, and appending a version of φ evaluated to TRUE, FALSE, and NULL to the end of Q, each in the form of a WHERE or HAVING clause. Since these 3 boolean values cover the universal set of states associated with the predicate, their union, Q′, must return the same result set as that returned by the query without the predicate, Q, i.e. ResultSet(Q) = ResultSet(Q′).

implementation steps for ternary logic — Fig 3. Implementation steps for Ternary Logic Partitioning, inspired by Manuel Rigger’s SQLancer talk at the CMU Quarantine 2020 Database talks: https://youtu.be/_LLHssTadKA

The steps for implementing the test oracle are as follows:

Randomly generate the original query Q, in a similar fashion to random queries generated by fuzzers. To continue our apple analogy, Q would be fetch all apples.
Randomly generate the boolean predicate φ. φ corresponds to the condition in our analogy, to be red.
Create the semantically equivalent query Q′ using the Ternary Logic Partitioning (TLP) approach. In our analogy, Q′ corresponds to the verbose version of the command, to fetch apples that are red, that are not red, or where the color is unknown.
Fetch and compare result sets for Q and Q′.
Report a logic bug if there is a mismatch.

Ternary Logic Partitioning in SQLancer allows us to test a database like Postgres or Citus against itself, first by generating a (relatively) simple and a more complex version of a SQL query, and then by checking whether the added complexity introduces a logic bug. In other words, SQLancer’s TLP test oracle eliminates the need for a ground truth by designating the simpler SQL query as a heuristic for the expected output of the more complex query.

While there’s no guarantee that we can detect all logic bugs in a database like Postgres or Citus using query partitioning, SQLancer’s TLP has proven to be successful with the Citus extension to Postgres and with other databases as well.

Why care about the logic bugs SQLancer finds?

You might think, if SQLancer is testing super complex SQL queries that developers wouldn’t usually come up with (let alone users), why are these SQL queries important? Good question.

Well, while the logic errors are revealed by complex machine-generated SQL queries, it’s possible that the logic bugs themselves might lie in integral operations that affect behavior generally. For the Citus extension to Postgres in particular, the source of logic bugs has mostly been the parsing and deparsing of SQL queries during pushdown to the Citus worker nodes.

Another use case that highlights the usefulness of SQLancer: many of you probably use object-relational mappers (ORMs) to generate SQL queries rather than manually writing them, which makes it more likely that your computer-generated queries might wander into the error-prone zone tested by SQLancer.

Logic bug detection in Citus, at a glance

Running SQLancer to find logic bugs in your SQL involves 2 different phases of execution:

The preparation phase. SQLancer initializes the test database with Citus support, creates local, distributed, and reference tables inside the database, and performs insertions, updates, deletes, and other modifications to the tables via the randomly generated SQL commands.
The testing phase. SQLancer generates, executes, and compares the outputs of pairs of original and partitioned SELECT queries, in line with the TLP approach—until it finds a mismatch, i.e. logic bug.

In Figure 4 above, you can see the original SQL query and the partitioned query—a union of the 3 components with the predicates evaluated to TRUE, FALSE, and NULL, respectively—that failed to return identical result sets. As it turned out, Citus had a bug in its handling of explicit CROSS JOIN syntax when used in a complex join tree. Since cross joins are more commonly used in the implicit form (FROM t1, t4), this type of join tree had not come up in manual testing.

How I implemented SQLancer for the Citus extension to Postgres

I created the Citus implementation of SQLancer by reusing and extending components of the existing SQLancer support for PostgreSQL.

Programming a SQLancer implementation for Citus involved incorporating sharding across multiple nodes in the creation of test databases by SQLancer, expanding the test space by integrating Citus concepts and features such as distributed and reference tables, colocation, distribution key etc. into the database preparation phase, configuring the different types of JOINs supported by Citus in the SELECT statements generated during the testing phase, and adjusting the expected behavior to reflect Citus’ SQL coverage.

Sweet, @nazli_koyluoglu contributed the 1000th commit for @sqlancer_dbms. 💥 During Nazli’s some internship in the Postgres team here at Microsoft, Nazli used SQLancer to find logic bugs in the @citusdata extension to Postgres, and in @PostgreSQL too! HT @ThaNODnl @MengTangmu https://t.co/TigcqQ4ZNp
— Azure Database for PostgreSQL (@AzureDBPostgres) August 12, 2020

A potential positive side effect of my project to the Postgres community: Building Citus support into SQLancer on top of the existing Postgres support required reworking the Postgres support to prepare its Java classes to be extended by new implementations. The improvements and modifications I contributed to the Postgres implementation of SQLancer have thus paved the way for future SQLancer applications for other PostgreSQL extensions , as well.

Special Thanks

I had an incredible experience working on the Citus extension to Postgres during my summer internship in the Postgres team at Microsoft. I had the chance to join the Citus open source project and work with some pretty amazing people. I would like to thank my mentor, Nils Dijk, for his commitment to my growth, especially in a remote setting with limitations imposed by COVID-19. I would like to thank my manager, Utku Azman, for his constant encouragement to take initiative.

Thank you to Onder Kalaci and Marco Slot, whose contributions to the vision for our implementation of SQLancer helped tailor my project to make my work even more useful for the Citus team. And I want to thank the Citus team as a whole for welcoming me with their unsparing help.

Special thanks to Manuel Rigger, the creator of SQLancer, for sharing our excitement about the development of a SQLancer implementation for Citus and his willingness to collaborate.

.@nazli_koyluoglu just contributed a Citus implementation for SQLancer (see https://t.co/mrKnKCLeNx). Previously, she already contributed PostgreSQL implementation enhancements. Thank you for your awesome work! Check out the bugs she has found in Citus at https://t.co/BsMDExp3W9
— SQLancer (@sqlancer_dbms) August 11, 2020

The @citusdata team—now part of @Microsoft—has used SQLancer to find bugs in the Citus extension to PostgreSQL, an open source extension which transforms Postgres into a distributed database. https://t.co/27hSSzCtf4
— SQLancer (@sqlancer_dbms) August 11, 2020

Written by Nazli Ugur Koyluoglu

Summer 2020 intern on the Postgres engineering team at Microsoft working on the open source Citus database engine. Loves physics.