Citus 11 is out! Now 100% open source. Read all about it in Marco’s release blog. 💥
For roughly ten years now, I’ve had the pleasure of running and managing databases for people. In the early stages of building an application you move quickly, adding new tables and columns to your Postgres database to support new functionality. You move quickly, but you don’t worry too much because things are fast and responsive–largely because your data is small. Over time your application grows and matures. Your data model stabilizes, and you start to spend more time tuning and tweaking to ensure performance and stability stay where they need to. Eventually you get to the point where you miss the days of maintaining a small database, because life was easier then. Indexes were created quickly, joins were fast,
count(*) didn’t bring your database to a screeching halt, and vacuum was not a regular part of your lunchtime conversation. As you continue to tweak and optimize the system, you know you need a plan for the future and know how you’re going to continue to scale.
With Hyperscale (Citus) on Azure Database for PostgreSQL, we help many of those worries fade away. I am super excited to announce that Citus is now available on Microsoft Azure, as a new built-in deployment option on the Azure Database for PostgreSQL called Hyperscale (Citus).
Hyperscale (Citus) scales out your data across multiple physical nodes, with the underlying data being sharded into much smaller bits. The same database sharding principles that work for Facebook and Google are baked right into the database. But, unlike traditional sharded systems, your application doesn’t have to learn how to shard the data. With Azure Database for PostgreSQL, Hyperscale (Citus) takes Postgres, the open source relational database, and extends it with low level internal hooks.
This means you can go back to building new features and functionality, without having to deal with a massive database that is a pain to maintain. When you provision a Hyperscale (Citus) server group, you’ll have a coordinator node which is responsible for distributed planning, query routing, aggregation, as well as a number of worker nodes which you specify. The database cluster will come already setup with the Citus extension and preconfigured so you can begin adding data right away. Let’s dig in and get hands on to see how we can begin scaling and quit worrying about our database with Hyperscale (Citus).
Now with my new Hyperscale (Citus) server group (what I often call a database cluster), I can connect directly to a coordinator. My Hyperscale (Citus) coordinator is responsible for distributed query planning, routing, and aggregation at times. You can think of it mostly as air traffic control, directing the planes which are doing all the heavy lifting. Once I connect to my Hyperscale (Citus) cluster, I can create some standard Postgres tables. In this case I’m going to use some data you may be well used to working with from the other side: GitHub. GitHub thankfully makes much of their data publicly available.
CREATE TABLE github_events ( event_id bigint, event_type text, event_public boolean, repo_id bigint, payload jsonb, repo jsonb, user_id bigint, org jsonb, created_at timestamp ); CREATE TABLE github_users ( user_id bigint, url text, login text, avatar_url text, gravatar_id text, display_login text );
With Hyperscale (Citus), our cluster has some new capabilities that allow us to shard our data. By default when we run the
create_distributed_table command to distribute our data it will create 32 shards (this is configurable) and spread them across the nodes in your Postgres cluster:
SELECT create_distributed_table('github_events', 'user_id'); SELECT create_distributed_table('github_users', 'user_id');
Because under the covers each of my tables is sharded into something that looks roughly like:
github_events_003, etc. each table is significantly smaller. This means all those operations before that were intensive and stress inducing on a large database are worry free.
Now let’s load up some data and run some queries. To start with we’re going to use the bulk load utility
copy to load our data:
\copy github_events from events.csv CSV; \copy github_users from users.csv CSV;
We just loaded a little over 1,000,000 events in right at 30 seconds. For comparison sake, I did the same on a single node Postgres setup, went and made myself some coffee, came back, drank my coffee, and waited for my load to finish. In reality my load on a single node server finished in 4 minutes, but 8x improvement is nice–especially knowing I can add more nodes to make things faster. This parallelization thing is kinda fun. Querying my data is the same as it always has been, execute standard SQL. Except now Hyperscale (Citus) takes care of parallelizing my queries, managing distributed transactions, and more.
select count(*) from github_events; count --------- 1009960 (1 row) Time: 29.848 ms
In addition to parallelizing the load Hyperscale (Citus) can parallelize queries, index creation, autovacuum… in general just about all the operations that tend to keep you awake a night when it comes to taking care of your database.
If you’re curious to try Hyperscale (Citus) on Azure Database for PostgreSQL, just create your Azure account and give some of our tutorials a try, starting with the quickstart docs for Hyperscale (Citus).
Update: Some good news: since I first wrote this post, Hyperscale (Citus) is now generally available as part of our managed PostgreSQL database service on Azure. If you want to learn more:
Here is the definitive GA post I wrote about Hyperscale (Citus) on Azure Database for PostgreSQL.
Here is Claire’s post on architecting an analytics workflow by scaling out Postgres on Azure. All about how the Windows team supports 100s of active users every day (thousands each month) to make shiproom decisions, based on analysis of 20,000 diagnostic metrics from over 800 million devices.