The Citus Blog | Citus Data

Database constraints in Postgres: The last line of defense

Written byBy Will Leinweber | March 19, 2018Mar 19, 2018

Data has a certain gravity and inertia. Once it's stored it's not likely to be actively moved or frequently modified. At least not for your one source of truth. Protecting that data and ensuring it's both safely stored but also correct is worth the time investment because of the value it has.

Going further, your database schema and models are going to change far less than your application code. Because it changes less frequently the case can easily be made that spending some time to ensure correctness at the database level is a great return on time.

This post was the result of a recent talk I recently gave at PgDay Paris. The conference itself was a great local event in Paris, and while there we had a chance to meet with a few of our customers based in Paris as well. As it’s always great to get out in person and chat with people about Postgres and their experience in scaling their database, many remarked that the talk could be useful to others that weren’t there. So as I thought it would be worthwhile to write-up, and here you go:

Keep reading

Fun with SQL: generate_series in Postgres

Written byBy Craig Kerstiens | March 14, 2018Mar 14, 2018

There are times within Postgres where you may want to generate sample data or some consistent series of records to join in order for reporting. Enter the simple but handy set returning function of Postgres: generate_series. generate_series as the name implies allows you to generate a set of data starting at some point, ending at another point, and optionally set the incrementing value. generate_series works on two datatypes:

integers
timestamps

Keep reading

Distributed Execution of Subqueries and CTEs in Citus

Written byBy Marco Slot | March 9, 2018Mar 9, 2018

The latest release of the Citus database brings a number of exciting improvements for analytical queries across all the data and for real-time analytics applications. Citus already offered full SQL support on distributed tables for single-tenant queries and support for advanced subqueries that can be distributed ("pushed down") to the shards. With Citus 7.2, you can also use CTEs (common table expressions), set operations, and most subqueries thanks to a new technique we call "recursive planning".

Keep reading

The Postgres 10 feature you didn't know about: CREATE STATISTICS

Written byBy Samay Sharma | March 6, 2018Mar 6, 2018

If you’ve done some performance tuning with Postgres, you might have used EXPLAIN. EXPLAIN shows you the execution plan that the PostgreSQL planner generates for the supplied statement. It shows how the table(s) referenced by the statement will be scanned (using a sequential scan, index scan etc), and what join algorithms will be used if multiple tables are used. But, how does Postgres come up with these plans?

Keep reading

Fun with SQL: Relocating shards on a Citus database cluster

Written byBy Sai Srirampur | February 28, 2018Feb 28, 2018

The Citus extension to Postgres allows you to shard your Postgres database across multiple nodes without having to make major changes to your SaaS application. Citus then provides performance improvements (as compared to single-node Postgres) by transforming SQL queries and distributing queries across multiple nodes, thereby parallelizing the workload. This means that a 2 node, 4 core Citus database cluster could perform 4x faster than single node Postgres.

With the Citus shard rebalancer, you can easily scale your database cluster from 2 nodes to 3 nodes or 4 nodes, with no downtime. You simply run the move shard function on the co-location group you want move shards for, and Citus takes care of the rest. When Citus moves shards, it ensures tables that are co-located stay together. This means all of your joins, say, from orders to order_items still work, just as you'd expect.

Keep reading

When Postgres blocks: 7 tips for dealing with locks

Written byBy Marco Slot | February 22, 2018Feb 22, 2018

Last week I wrote about locking behaviour in Postgres, which commands block each other, and how you can diagnose blocked commands. Of course, after the diagnosis you may also want a cure. With Postgres it is possible to shoot yourself in the foot, but Postgres also offers you a way to stay on target. These are some of the important do’s and don’ts that we’ve seen as helpful when working with users to migrate from their single node Postgres database to Citus or when building new real-time analytics apps on Citus.

Keep reading

Three Approaches to PostgreSQL Replication and Backup

Written byBy Ozgun Erdogan | February 21, 2018Feb 21, 2018

The Citus distributed database scales out PostgreSQL through sharding, replication, and query parallelization. For replication, our database as a service (by default) leverages the streaming replication logic built into Postgres.

When we talk to Citus users, we often hear questions about setting up Postgres high availability (HA) clusters and managing backups. How do you handle replication and machine failures? What challenges do you run into when setting up Postgres HA?

The PostgreSQL database follows a straightforward replication model. In this model, all writes go to a primary node. The primary node then locally applies those changes and propagates them to secondary nodes.

In the context of Postgres, the built-in replication (known as "streaming replication") comes with several challenges:

Postgres replication doesn’t come with built-in monitoring and failover. When the primary node fails, you need to promote a secondary to be the new primary. This promotion needs to happen in a way where clients write to only one primary node, and they don’t observe data inconsistencies.
Many Postgres clients (written in different programming languages) talk to a single endpoint. When the primary node fails, these clients will keep retrying the same IP or DNS name. This makes failover visible to the application.
Postgres replicates its entire state. When you need to construct a new secondary node, the secondary needs to replay the entire history of state change from the primary node. This process is resource intensive—and makes it expensive to kill nodes in the head and bring up new ones.

The first two challenges are well understood. Since the last challenge isn’t as widely recognized, we’ll examine it in this blog post.

Keep reading

PostgreSQL rocks, except when it blocks: Understanding locks

Written byBy Marco Slot | February 15, 2018Feb 15, 2018

On the Citus open source team, we engineers take an active role in helping our users scale out their Postgres database, be it for migrating an existing application or building a new application from scratch. This means we help you with distributing your relational data model—and also with getting the most out of Postgres.

One problem I often see users struggle with when it comes to Postgres is locks. While Postgres is amazing at running multiple operations at the same time, there are a few cases in which Postgres needs to block an operation using a lock. You therefore have to be careful about which locks your transactions take, but with the high-level abstractions that PostgreSQL provides, it can be difficult to know exactly what will happen. This post aims to demystify the locking behaviors in Postgres, and to give advice on how to avoid common problems.

Keep reading

Using Hibernate and Spring to Build Multi-Tenant Java Apps

Written byBy Joe Kutner | February 13, 2018Feb 13, 2018

If you're building a Java app, there's a good chance you're using Hibernate. The Hibernate ORM is a nearly ubiquitous choice for Java developers who need to interact with a relational database. It's mature, widely supported, and feature rich—as demonstrated by its support for multi tenant applications.

Hibernate officially supports two different multi-tenancy mechanisms: separate database and separate schema. Unfortunately, both of these mechanisms come with some downsides in terms of scaling. A third Hibernate multi-tenancy mechanism, a tenant discriminator, also exists, and it’s usable—but it’s still considered a work-in-progress by some. Unlike the separate database and separate schema approaches, which require distinct database connections for each tenant, Hibernate’s tenant discriminator model stores tenant data in a single database and partitions records with either a simple column value or a complex SQL formula.

But fear not, despite the unfinished state of Hibernate's built-in support for a tenant discriminator (or in simple terms tenant_id), it's possible to implement your own discriminator using standard Spring, Hibernate, and AspectJ mechanisms that work quite well. The Hibernate tenant discriminator model works well as you start small on a single-node Postgres database, and even better, tenant discriminator can continue to scale as your data grows by leveraging the Citus extension to Postgres.

Keep reading

Distributing Postgres with a Distributed Team at Citus Data

Written byBy Sumedh Pathak | February 3, 2018Feb 3, 2018

Over the last two years, our engineering team at Citus Data has shortened release cycles from 12 months all the way down to 8 weeks. The most recent 7.2 release of the Citus database took 8 weeks exactly, start to finish.

These shortened release cycles have been chock full of new capabilities for our users, including distributed deadlock detection in Citus 7.0, multi-shard updates and deletes in Citus 7.1, and support for CTE’s (common table expressions) and complex Postgres subqueries in Citus 7.2.

On the Citus Cloud side (that’s our fully-managed database as a service that runs on AWS), we’ve recently added fork, followers, fully-online "warp" migration from existing PostgreSQL installations, and point-in-time-recovery (PITR), just to name a few.

When I step back to think about how we got here (as a co-founder of Citus Data, I’ve been here since the beginning), it’s no surprise that I attribute much of what we’ve accomplished to our team. But here’s the point about all these accomplishments that I think is so interesting: our engineering team is distributed across 5 countries and 6 different cities.

Keep reading