Marco Slot

Marco Slot

CITUS BLOG AUTHOR PROFILE

Former lead engineer for the Citus database engine at Microsoft. Speaker at Postgres Conf EU, PostgresOpen, pgDay Paris, Hello World, SIGMOD, & lots of meetups. Talk selection team member for Citus Con: An Event for Postgres. PhD in distributed systems. Loves mountain hiking.

@marcoslot marcocitus

PUBLISHED ARTICLES
Marco Slot

Making Postgres stored procedures 9X faster in Citus

Written by By Marco Slot | November 21, 2020 Nov 21, 2020

Stored procedures are widely used in commercial relational databases. You write most of your application logic in PL/SQL and achieve notable performance gains by pushing this logic into the database. As a result, customers who are looking to migrate from other databases to PostgreSQL usually make heavy use of stored procedures.

When migrating from a large database, using the Citus extension to distribute your database can be an attractive option, because you will always have enough hardware capacity to power your workload. The Hyperscale (Citus) option in Azure Database for PostgreSQL makes it easy to get a managed Citus cluster in minutes.

In the past, customers who migrated stored procedures to Citus often reported poor performance because each statement in the procedure involved an extra network round trip between the Citus coordinator node and the worker nodes. We also observed this ourselves when we evaluated Citus performance using the TPC-C-based workload in HammerDB (TPROC-C), which is implemented using stored procedures.

Keep reading
Marco Slot

Evolving pg_cron together: Postgres 13, audit log, background workers, & job names

Written by By Marco Slot | October 31, 2020 Oct 31, 2020

One of the unique things about Postgres is that it is highly programmable via PL/pgSQL and extensions. Postgres is so programmable that I often think of Postgres as a computing platform rather than just a database (or a distributed computing platform—with Citus). As a computing platform, I always felt that Postgres should be able to take actions in an automated way. That is why I created the open source pg_cron extension back in 2016 to run periodic jobs in Postgres—and why I continue to maintain pg_cron now that I work on the Postgres team at Microsoft.

Using pg_cron, you can schedule Postgres queries to run periodically, according to the familiar cron syntax. Some typical examples:

Keep reading
Marco Slot

Talking about Citus & Postgres at any scale

Written by By Marco Slot | September 17, 2020 Sep 17, 2020

I recently gave a talk about the Citus extension to Postgres at the Warsaw PostgreSQL Users Group. Unfortunately, I did not get to go in person to beautiful Warsaw, but it was still a nice way to interact with the global Postgres community and talk about what Citus is, how it works, and what it can do for you.

Keep reading
Marco Slot

What’s new in the Citus 9.4 extension to Postgres

Written by By Marco Slot | September 5, 2020 Sep 5, 2020

Our latest release to the Citus extension to Postgres is Citus 9.4. If you’re not yet familiar, Citus transforms Postgres into a distributed database, distributing your data and your SQL queries across multiple nodes. This post is basically the Citus 9.4 release notes.

If you’re ready to get started with Citus, it’s easy to download Citus open source packages for 9.4.

I always recommend people check out docs.citusdata.com to learn more. The Citus documentation has rigorous tutorials, details on every Citus feature, explanations of key concepts—things like choosing the distribution column—tutorials on how you can set up Citus locally on a single server, how to install Citus on multiple servers, how to build a real-time analytics dashboard, how to build a multi-tenant database, and more...

Keep reading
Marco Slot

How the Citus distributed query executor adapts to your Postgres workload

Written by By Marco Slot | April 27, 2020 Apr 27, 2020

In one of our recent releases of the open source Citus extension, we overhauled the way Citus executes distributed SQL queries—with the net effect being some huge improvements in terms of performance, user experience, Postgres compatibility, and resource management. The Citus executor is now able to dynamically adapt to the type of distributed SQL query, ensuring fast response times both for quick index lookups and big analytical queries.

We call this new Citus feature the “adaptive executor” and we thought it would be useful to walk through what the Citus adaptive executor means for Postgres and how it works.

Keep reading
Marco Slot

Citus 9.2 speeds up large scale HTAP workloads on Postgres

Written by By Marco Slot | March 2, 2020 Mar 2, 2020

Some of you have been asking, “what’s happening with the Citus open source extension to Postgres?” The short answer is: a lot. More and more users have adopted the Citus extension in order to scale out Postgres, to increase performance and enable growth. And you’re probably not surprised to learn that since Microsoft acquired Citus Data last year, our engineering team has grown quite a bit—and we’ve been continuing to evolve and innovate on the Citus open source extension.

Our newest release is Citus 9.2. We’ve updated the installation instructions on our Download page and in our Citus documentation, and now it’s time to take a walk through what’s new.

Keep reading
Marco Slot

Why the RDBMS is the future of distributed databases, ft. Postgres and Citus

Written by By Marco Slot | November 30, 2018 Nov 30, 2018

Around 10 years ago I joined Amazon Web Services and that’s where I first saw the importance of trade-offs in distributed systems. In university I had already learned about the trade-offs between consistency and availability (the CAP theorem), but in practice the spectrum goes a lot deeper than that. Any design decision may involve trade-offs between latency, concurrency, scalability, durability, maintainability, functionality, operational simplicity, and other aspects of the system—and those trade-offs have meaningful impact on the features and user experience of the application, and even on the effectiveness of the business itself.

Perhaps the most challenging problem in distributed systems, in which the need for trade-offs is most apparent, is building a distributed database. When applications began to require databases that could scale across many servers, database developers began to make extreme trade-offs. In order to achieve scalability over many nodes, distributed key-value stores (NoSQL) put aside the rich feature set offered by the traditional relational database management systems (RDBMS), including SQL, joins, foreign keys, and ACID guarantees. Since everyone wants scalability, it would only be a matter of time before the RDBMS would disappear, right? Actually, relational databases have continued to dominate the database landscape. And here's why:

Keep reading
Marco Slot

High performance distributed DML in Citus

Written by By Marco Slot | July 25, 2018 Jul 25, 2018

One of the many unique abilities of SQL databases is to transform data using advanced SQL queries and joins in a transactional manner. Commands like UPDATE and DELETE are commonly used for manipulating individual rows, but they become truly powerful when you can use subqueries to determine which rows to modify and how to modify them. It allows you to implement batch processing operations in a thread-safe, transactional, scalable manner.

Citus recently added support for UPDATE/DELETE commands with subqueries that span across all the data. Together with the CTE infrastructure that we’ve introduced over the past few releases, this gives you a new set of powerful distributed data transformation commands. As always, we’ve made sure that queries are executed as quickly and efficiently as possible by spreading out the work to where the data is stored.

Let’s look at an example of how you can use UPDATE/DELETE with subqueries.

Keep reading
Marco Slot

Scalable incremental data aggregation on Postgres and Citus

Written by By Marco Slot | June 14, 2018 Jun 14, 2018

Many companies generate large volumes of time series data from events happening in their application. It’s often useful to have a real-time analytics dashboard to spot trends and changes as they happen. You can build a real-time analytics dashboard on Postgres by constructing a simple pipeline:

  1. Load events into a raw data table in batches
  2. Periodically aggregate new events into a rollup table
  3. Select from the rollup table in the dashboard

For large data streams, Citus (an open source extension to Postgres that scales out Postgres horizontally) can scale out each of these steps across all the cores in a cluster of Postgres nodes.

One of the challenges of maintaining a rollup table is tracking which events have already been aggregated—so you can make sure that each event is aggregated exactly once. A common technique to ensure exactly-once aggregation is to run the aggregation for a particular time period after that time period is over. We often recommend aggregating at the end of the time period for its simplicity, but you cannot provide any results before the time period is over and backfilling is complicated.

Keep reading
Marco Slot

Distributed Execution of Subqueries and CTEs in Citus

Written by By Marco Slot | March 9, 2018 Mar 9, 2018

The latest release of the Citus database brings a number of exciting improvements for analytical queries across all the data and for real-time analytics applications. Citus already offered full SQL support on distributed tables for single-tenant queries and support for advanced subqueries that can be distributed ("pushed down") to the shards. With Citus 7.2, you can also use CTEs (common table expressions), set operations, and most subqueries thanks to a new technique we call "recursive planning".

Keep reading

Page 2 of 6