Citus 10.2 is out! 10.2 brings you new columnar & time series features—and is ready to support Postgres 14. Read the new Citus 10.2 blog.

Skip navigation
Marco Slot

Marco Slot

CITUS BLOG AUTHOR PROFILE

Lead engineer on the Citus engine team at Microsoft. Speaker at Postgres Conf EU, PostgresOpen, pgDay Paris, Hello World, SIGMOD, & lots of meetups. PhD in distributed systems. Loves mountain hiking.

@marcoslot marcocitus

PUBLISHED ARTICLES
Marco Slot

Scalable incremental data aggregation on Postgres and Citus

Written by By Marco Slot | June 14, 2018 Jun 14, 2018

Many companies generate large volumes of time series data from events happening in their application. It’s often useful to have a real-time analytics dashboard to spot trends and changes as they happen. You can build a real-time analytics dashboard on Postgres by constructing a simple pipeline:

  1. Load events into a raw data table in batches
  2. Periodically aggregate new events into a rollup table
  3. Select from the rollup table in the dashboard

For large data streams, Citus (an open source extension to Postgres that scales out Postgres horizontally) can scale out each of these steps across all the cores in a cluster of Postgres nodes.

One of the challenges of maintaining a rollup table is tracking which events have already been aggregated—so you can make sure that each event is aggregated exactly once. A common technique to ensure exactly-once aggregation is to run the aggregation for a particular time period after that time period is over. We often recommend aggregating at the end of the time period for its simplicity, but you cannot provide any results before the time period is over and backfilling is complicated.

Keep reading
Marco Slot

Distributed Execution of Subqueries and CTEs in Citus

Written by By Marco Slot | March 9, 2018 Mar 9, 2018

The latest release of the Citus database brings a number of exciting improvements for analytical queries across all the data and for real-time analytics applications. Citus already offered full SQL support on distributed tables for single-tenant queries and support for advanced subqueries that can be distributed (“pushed down”) to the shards. With Citus 7.2, you can also use CTEs (common table expressions), set operations, and most subqueries thanks to a new technique we call “recursive planning”.

Keep reading
Marco Slot

When Postgres blocks: 7 tips for dealing with locks

Written by By Marco Slot | February 22, 2018 Feb 22, 2018

Last week I wrote about locking behaviour in Postgres, which commands block each other, and how you can diagnose blocked commands. Of course, after the diagnosis you may also want a cure. With Postgres it is possible to shoot yourself in the foot, but Postgres also offers you a way to stay on target. These are some of the important do’s and don’ts that we’ve seen as helpful when working with users to migrate from their single node Postgres database to Citus or when building new real-time analytics apps on Citus.

Keep reading
Marco Slot

PostgreSQL rocks, except when it blocks: Understanding locks

Written by By Marco Slot | February 15, 2018 Feb 15, 2018

On the Citus open source team, we engineers take an active role in helping our users scale out their Postgres database, be it for migrating an existing application or building a new application from scratch. This means we help you with distributing your relational data model—and also with getting the most out of Postgres.

One problem I often see users struggle with when it comes to Postgres is locks. While Postgres is amazing at running multiple operations at the same time, there are a few cases in which Postgres needs to block an operation using a lock. You therefore have to be careful about which locks your transactions take, but with the high-level abstractions that PostgreSQL provides, it can be difficult to know exactly what will happen. This post aims to demystify the locking behaviors in Postgres, and to give advice on how to avoid common problems.

Keep reading
Marco Slot

Podyn: DynamoDB to PostgreSQL replication and migration tool

Written by By Marco Slot | September 22, 2017 Sep 22, 2017

Wouldn’t it be great if you could run SQL queries on your data in DynamoDB? While this isn’t possible directly, there is an alternative: With Podyn, you can automatically replicate the schema, data, and changes in your DynamoDB tables to Postgres. Once your data is flowing into Postgres, you can start using a wide array of features including views, indexes, rollup tables, and advanced SQL queries. And if you chose Dynamo because you needed scale beyond what you thought a single Postgres instance could provide, you can replicate it into Citus.

Keep reading
Marco Slot

Databases and Distributed Deadlocks: A FAQ

Written by By Marco Slot | August 31, 2017 Aug 31, 2017

Since Citus is a distributed database, we often hear questions about distributed transactions. Specifically, people ask us about transactions that modify data living on different machines.

So we started to work on distributed transactions. We then identified distributed deadlock detection as the building block to enable distributed transactions in Citus.

First some background: At Citus we focus on scaling out Postgres. We want to make Postgres performance & Postgres scale something you never have to worry about. We even have a cloud offering, a fully-managed database as a service, to make Citus even more worry-free. We carry the pager so you don’t have to and all that. And because we’ve built Citus using the PostgreSQL extension APIs, Citus stays in sync with all the latest Postgres innovations as they are released (aka we are not a fork.) Yes, we’re excited for Postgres 10 like all the rest of you :)

Back to distributed deadlocks: As we began working on distributed deadlock detection, we realized that we needed to clarify certain concepts. So we created a simple FAQ for the Citus development team. And we found ourselves referring back to the FAQ over and over again. So we decided to share it here on our blog, in the hopes you find it useful.

Keep reading
Marco Slot

Scaling out complex SQL transactions in multi-tenant apps on Postgres

Written by By Marco Slot | June 2, 2017 Jun 2, 2017

Distributed databases often require you to give up SQL and ACID transactions as a trade-off for scale. Citus is a different kind of distributed database. As an extension to PostgreSQL, Citus can leverage PostgreSQL’s internal logic to distribute more sophisticated data models. If you’re building a multi-tenant application, Citus can transparently scale out the underlying database in a way that allows you to keep using advanced SQL queries and transaction blocks.

In multi-tenant applications, most data and queries are specific to a particular tenant. If all tables have a tenant ID column and are distributed by this column, and all queries filter by tenant ID, then Citus supports the full SQL functionality of PostgreSQL—including complex joins and transaction blocks—by transparently delegating each query to the node that stores the tenant’s data. This means that with Citus, you don’t lose any of the functionality or transactional guarantees that you are used to in PostgreSQL, even though your database has been transparently scaled out across many servers. In addition, you can manage your distributed database through parallel DDL, tenant isolation, high performance data loading, and cross-tenant queries.

Keep reading
Marco Slot

Postgres Parallel indexing in Citus

Written by By Marco Slot | January 17, 2017 Jan 17, 2017

Indexes are an essential tool for optimizing database performance and are becoming ever more important with big data. However, as the volume of data increases, index maintenance often becomes a write bottleneck, especially for advanced index types which use a lot of CPU time for every row that gets written. Index creation may also become prohibitively expensive as it may take hours or even days to build a new index on terabytes of data in postgres. As of Citus 6.0, we’ve made creating and maintaining indexes that much faster through parallelization.

Keep reading
Marco Slot

Scaling out relational data models, and SQL, through co-location

Written by By Marco Slot | December 22, 2016 Dec 22, 2016

Relational databases are the first choice of data store for many applications due to their enormous flexibility and reliability. Historically the one knock against relational databases is that they can only run on a single machine, which creates inherent...

Keep reading
Marco Slot

Real-time event aggregation at scale using Postgres w/ Citus

Written by By Marco Slot | November 29, 2016 Nov 29, 2016

Citus is commonly used to scale out event data pipelines on top of PostgreSQL. Its ability to transparently shard data and parallelise queries over many machines makes it possible to have real-time responsiveness even with terabytes of data. Users with very high data volumes often store pre-aggregated data to avoid the cost of processing raw data at run-time. With Citus 6.0 this type of workflow became even easier using a new feature that enables pre-aggregation inside the database in a massively parallel fashion using standard SQL. For large datasets, querying pre-computed aggregation tables can be orders of magnitude faster than querying the facts table on demand.

Keep reading

Page 2 of 5