Citus 10.2 is out! If you are not yet familiar with Citus, it is an open source extension to Postgres that transforms Postgres into a distributed database—so you can achieve high performance at any scale. The Citus open source packages are available for download. And Citus is also available in the cloud as a managed service, too.
You can see a bulleted list of all the changes in the CHANGELOG on GitHub. This post is your guide to what’s new in Citus 10.2, including some of these headline features.
[UPDATE in Sep 2021]: This blog post was originally written during the PostgreSQL 14 development cycle. The feature discussed is now a candidate for PostgreSQL 15 and the text has been updated to reflect this.
As part of my work on the open source PostgreSQL team at Microsoft, I've been developing a new feature for PostgreSQL to track dependencies on collation versions, with help from co-author Julien Rouhaud and many others who have contributed ideas. It's taken a long time to build a consensus on how to tackle this thorny problem (work I began at EnterpriseDB and continued at Microsoft), and you can read about some of the details and considerations in the commit message below and the referenced discussion thread. We're not quite done with that yet. It was originally planned for PostgreSQL 14, but some unhandled complications arose so this project is back in the workshop.
Author: Thomas Munro <email@example.com>
Date: Mon Nov 2 19:50:45 2020 +1300
Track collation versions for indexes.
Record the current version of dependent collations in pg_depend when
creating or rebuilding an index. When accessing the index later, warn
that the index may be corrupted if the current version doesn't match.
Thanks to Douglas Doole, Peter Eisentraut, Christoph Berg, Laurenz Albe,
Michael Paquier, Robert Haas, Tom Lane and others for very helpful
Author: Thomas Munro <firstname.lastname@example.org>
Author: Julien Rouhaud <email@example.com>
Reviewed-by: Peter Eisentraut <firstname.lastname@example.org> (earlier versions)
In this article I'll talk about the problem we need to solve—that PostgreSQL indexes can get corrupted by changes in collations that occur naturally over time—and how the new feature will make things better in a future version of PostgreSQL. Plus, you’ll get a bit of background on collations, too.
When Django developer and Azure Postgres* engineer Louise Grandjonc confirmed that she could sit down with me for an interview in the days leading up to DjangoCon 2019, I jumped at the chance. Those of you who were in the room for Louise’s talk this week probably understand why. Louise explains technical topics in a way that makes sense—and she often uses unusual (and fun) examples, from crocodiles to owls, from Harry Potter to Taylor Swift.
The twelve-factor app changed the way we build SaaS applications. Explicit dependency management, separating config from code, scaling out your app concurrently—these design principles took us from giant J2EE apps to apps that scale predictably on the web. One of these 12 factors has long stood out as a challenge when it comes to databases: dev/prod parity. Sure, you can run the exact same version of your database, and have a sandbox copy, but testing and staging with production data... that's a different story.
Written byByMarco Slot | June 14, 2018Jun 14, 2018
Many companies generate large volumes of time series data from events happening in their application. It’s often useful to have a real-time analytics dashboard to spot trends and changes as they happen. You can build a real-time analytics dashboard on Postgres by constructing a simple pipeline:
Load events into a raw data table in batches
Periodically aggregate new events into a rollup table
Select from the rollup table in the dashboard
For large data streams, Citus (an open source extension to Postgres that scales out Postgres horizontally) can scale out each of these steps across all the cores in a cluster of Postgres nodes.
One of the challenges of maintaining a rollup table is tracking which events have already been aggregated—so you can make sure that each event is aggregated exactly once. A common technique to ensure exactly-once aggregation is to run the aggregation for a particular time period after that time period is over. We often recommend aggregating at the end of the time period for its simplicity, but you cannot provide any results before the time period is over and backfilling is complicated.
At Citus we spend a lot of time working with customers on data modeling, optimizing queries, and adding indexes to make things snappy. My goal is to be as available for our customers as we need to be, in order to make you successful. Part of that is keeping your Citus cluster well tuned and performant which we take care of for you. Another part is helping you with everything you need to know about Postgres and Citus. After all a healthy and performant database means a fast performing app and who wouldn’t want that. Today we’re going to condense some of the information we’ve shared directly with customers about Postgres indexes.
Postgres has a number of index types, and with each new release seems to come with another new index type. Each of these indexes can be useful, but which one to use depends on 1. the data type and then sometimes 2. the underlying data within the table, and 3. the types of lookups performed. In what follows we'll look at a quick survey of the index types available to you in Postgres and when you should leverage each. Before we dig in, here’s a quick glimpse of the indexes we’ll walk you through:
Postgres indexes make your application fast. And while one option is to analyze each of your relational database queries with pg_stat_statements to see where you should add indexes... an alternative fix (and a quick one at that) could be to add indexes to each and every database table—and every column—within your database. To make this easy for you, here's a query you can run that will create the CREATE INDEX commands for every table and column in your Postgres database.
Disclaimer: Adding an index to every column of every table is not a best practice for production databases. And it’s certainly not something we recommend at Citus Data, where we take scaling out Postgres and database performance very seriously. But indexing all the things is a fun “what if” thought exercise that is well worth thinking about, to understand the value of indexes in Postgres. What if you indexed all the things?
With that disclaimer out of the way, onto my advice of how to index all the things as a way to understand your database’s performance.
Written byByMarco Slot | January 17, 2017Jan 17, 2017
Indexes are an essential tool for optimizing database performance and are becoming ever more important with big data. However, as the volume of data increases, index maintenance often becomes a write bottleneck, especially for advanced index types which use a lot of CPU time for every row that gets written. Index creation may also become prohibitively expensive as it may take hours or even days to build a new index on terabytes of data in postgres. As of Citus 6.0, we’ve made creating and maintaining indexes that much faster through parallelization.
Written byByMarco Slot | April 28, 2016Apr 28, 2016
Product search is a common, yet sometimes challenging use-case for online retailers and marketplaces. It typically involves a combination of full-text search and filtering by attributes which differ for every product category. More complex use-cases may have many sellers that offer the same product, but with a different price and different properties.