Citus Blog

Articles tagged: deep dive

In one of our recent releases of the open source Citus extension, we overhauled the way Citus executes distributed SQL queries—with the net effect being some huge improvements in terms of performance, user experience, Postgres compatibility, and resource management. The Citus executor is now able to dynamically adapt to the type of distributed SQL query, ensuring fast response times both for quick index lookups and big analytical queries.

We call this new Citus feature the “adaptive executor” and we thought it would be useful to walk through what the Citus adaptive executor means for Postgres and how it works.

Keep reading

Around 10 years ago I joined Amazon Web Services and that’s where I first saw the importance of trade-offs in distributed systems. In university I had already learned about the trade-offs between consistency and availability (the CAP theorem), but in practice the spectrum goes a lot deeper than that. Any design decision may involve trade-offs between latency, concurrency, scalability, durability, maintainability, functionality, operational simplicity, and other aspects of the system—and those trade-offs have meaningful impact on the features and user experience of the application, and even on the effectiveness of the business itself.

Perhaps the most challenging problem in distributed systems, in which the need for trade-offs is most apparent, is building a distributed database. When applications began to require databases that could scale across many servers, database developers began to make extreme trade-offs. In order to achieve scalability over many nodes, distributed key-value stores (NoSQL) put aside the rich feature set offered by the traditional relational database management systems (RDBMS), including SQL, joins, foreign keys, and ACID guarantees. Since everyone wants scalability, it would only be a matter of time before the RDBMS would disappear, right? Actually, relational databases have continued to dominate the database landscape. And here's why:

Keep reading
Marco Slot

High performance distributed DML in Citus

Written byBy Marco Slot | July 25, 2018Jul 25, 2018

One of the many unique abilities of SQL databases is to transform data using advanced SQL queries and joins in a transactional manner. Commands like UPDATE and DELETE are commonly used for manipulating individual rows, but they become truly powerful when you can use subqueries to determine which rows to modify and how to modify them. It allows you to implement batch processing operations in a thread-safe, transactional, scalable manner.

Citus recently added support for UPDATE/DELETE commands with subqueries that span across all the data. Together with the CTE infrastructure that we’ve introduced over the past few releases, this gives you a new set of powerful distributed data transformation commands. As always, we’ve made sure that queries are executed as quickly and efficiently as possible by spreading out the work to where the data is stored.

Let’s look at an example of how you can use UPDATE/DELETE with subqueries.

Keep reading
Marco Slot

Distributed Execution of Subqueries and CTEs in Citus

Written byBy Marco Slot | March 9, 2018Mar 9, 2018

The latest release of the Citus database brings a number of exciting improvements for analytical queries across all the data and for real-time analytics applications. Citus already offered full SQL support on distributed tables for single-tenant queries and support for advanced subqueries that can be distributed ("pushed down") to the shards. With Citus 7.2, you can also use CTEs (common table expressions), set operations, and most subqueries thanks to a new technique we call "recursive planning".

Keep reading
Joe Kutner

Using Hibernate and Spring to Build Multi-Tenant Java Apps

Written byBy Joe Kutner | February 13, 2018Feb 13, 2018

If you're building a Java app, there's a good chance you're using Hibernate. The Hibernate ORM is a nearly ubiquitous choice for Java developers who need to interact with a relational database. It's mature, widely supported, and feature rich—as demonstrated by its support for multi tenant applications.

Hibernate officially supports two different multi-tenancy mechanisms: separate database and separate schema. Unfortunately, both of these mechanisms come with some downsides in terms of scaling. A third Hibernate multi-tenancy mechanism, a tenant discriminator, also exists, and it’s usable—but it’s still considered a work-in-progress by some. Unlike the separate database and separate schema approaches, which require distinct database connections for each tenant, Hibernate’s tenant discriminator model stores tenant data in a single database and partitions records with either a simple column value or a complex SQL formula.

But fear not, despite the unfinished state of Hibernate's built-in support for a tenant discriminator (or in simple terms tenant_id), it's possible to implement your own discriminator using standard Spring, Hibernate, and AspectJ mechanisms that work quite well. The Hibernate tenant discriminator model works well as you start small on a single-node Postgres database, and even better, tenant discriminator can continue to scale as your data grows by leveraging the Citus extension to Postgres.

Keep reading

Distributed databases often require you to give up SQL and ACID transactions as a trade-off for scale. Citus is a different kind of distributed database. As an extension to PostgreSQL, Citus can leverage PostgreSQL’s internal logic to distribute more sophisticated data models. If you’re building a multi-tenant application, Citus can transparently scale out the underlying database in a way that allows you to keep using advanced SQL queries and transaction blocks.

In multi-tenant applications, most data and queries are specific to a particular tenant. If all tables have a tenant ID column and are distributed by this column, and all queries filter by tenant ID, then Citus supports the full SQL functionality of PostgreSQL—including complex joins and transaction blocks—by transparently delegating each query to the node that stores the tenant’s data. This means that with Citus, you don’t lose any of the functionality or transactional guarantees that you are used to in PostgreSQL, even though your database has been transparently scaled out across many servers. In addition, you can manage your distributed database through parallel DDL, tenant isolation, high performance data loading, and cross-tenant queries.

Keep reading
Marco Slot

Real-time event aggregation at scale using Postgres w/ Citus

Written byBy Marco Slot | November 29, 2016Nov 29, 2016

Citus is commonly used to scale out event data pipelines on top of PostgreSQL. Its ability to transparently shard data and parallelise queries over many machines makes it possible to have real-time responsiveness even with terabytes of data. Users with very high data volumes often store pre-aggregated data to avoid the cost of processing raw data at run-time. With Citus 6.0 this type of workflow became even easier using a new feature that enables pre-aggregation inside the database in a massively parallel fashion using standard SQL. For large datasets, querying pre-computed aggregation tables can be orders of magnitude faster than querying the facts table on demand.

Keep reading
Joe Nelson

Faster PostgreSQL Counting

Written byBy Joe Nelson | October 12, 2016Oct 12, 2016

Everybody counts, but not always quickly. This article is a close look into how PostgreSQL optimizes counting. If you know the tricks there are ways to count rows orders of magnitude faster than you do already.

The problem is actually underdescribed...

Keep reading
Eren Basak

How Distributed Outer Joins on PostgreSQL with Citus Work

Written byBy Eren Basak | October 10, 2016Oct 10, 2016

SQL is a very powerful language for analyzing and reporting against data. At the core of SQL is the idea of joins and how you combine various tables together. One such type of join: outer joins are useful when we need to retain rows, even if it has no match on the other side.

And while the most common type of join, inner join, against tables A and B would bring only the tuples that have a match for both A and B, outer joins give us the ability to bring together from say all of table A even if they don’t have a corresponding match in table B. For example, let's say you keep customers in one table and purchases in another table. When you want to see all purchases of customers, you may want to see all customers in the result even if they did not do any purchases yet. Then, you need an outer join. Within this post we’ll analyze a bit on what outer joins are, and then how we support them in a distributed fashion on Citus.

Keep reading

It may surprise you that pagination, pervasive as it is in web applications, is easy to implement inefficiently. In this article we'll examine several methods of server-side pagination and discuss their tradeoffs when implemented in PostgreSQL. This article will help you identify which technique is appropriate for your situation, including some you may not have seen before which rely on physical clustering and the database stats collector.

Keep reading

Page 2 of 3