Migrating from single-node Postgres to Citus

There are a lot of things that are everyday occurrences for engineering teams. Deploying new code, deploying a new service, it’s even fairly common to deploy a net new data store or language. But migrating from one database to another is far more rare. While migrating your database can seem like a daunting task, there are lessons you can learn from others—and steps you can take to minimize risk in migrating from one database to another.

At Citus Data, we’ve helped many a customer migrate from single node Postgres, like RDS or Heroku Postgres, to a distributed Citus database cluster, so they can scale out and take advantage of the compute, memory, and disk resources of a distributed, scale-out solution. So we’ve been privy to some valuable lessons learned, and we’ve developed some best practices. Here you can find your guide for steps to follow as you start to create your migration plan to Citus.

Craig Kerstiens Sep 20, 2017

How Citus works (a look at dynamic executors)

In the beginning there was Postgres

We love Postgres at Citus. And rather than create a newfangled database from scratch, we implemented Citus as an extension to Postgres. We’ve talked a lot on our blog here about you can leverage Citus, about key use cases, and different data models and sharding approaches. But we haven’t spent a lot of time explaining how Citus works. So if you want to dive deeper into how Citus works, here we’re going to walk through how Citus shards the data all the way through to how the executors run queries.

Distributing data within Citus

Citus gets its benefits from sharding your data which allows us to split the data across multiple physical nodes. When your tables are significantly smaller due to sharding your indexes are smaller, vacuum runs faster, everything works like it did when your database was smaller and easier to manage.

Craig Kerstiens Sep 15, 2017

Citus 7: Transactions, Framework Integration, and Postgres 10

“Thirty years ago, my older brother was trying to get a report on birds written that he’d had three months to write. It was due the next day.

We were out at our family cabin in Bolinas, and he was at the kitchen table close to tears, surrounded by binder paper and pencils and unopened books on birds, immobilized by the hugeness of the task ahead. Then my father sat down beside him, put his arm around my brother’s shoulder, and said, ‘Bird by bird, buddy. Just take it bird by bird.’”

Bird by Bird: Some Instructions on Writing and Life, by Anne LaMott

When we started working on Citus, our vision was to combine the power of relational databases with the elastic scale of NoSQL. To do this, we took a different approach. Instead of building a new database from scratch, we leveraged PostgreSQL’s new extension APIs. This way, Citus would make Postgres a distributed database and integrate with the rich ecosystem of tools you already use.

When PostgreSQL is involved, executing on this vision isn’t a simple task. The PostgreSQL manual offers 3,558 pages of features built over two decades. The tools built around Postgres use and combine these features in unimaginable ways.

After our Citus open source announcement, we talked to many of you about scaling out your relational database. In every conversation, we’d hear about different Postgres features that needed to scale out of the box. We’d take notes from our meeting and add these features into an internal document. The list would keep getting longer, and longer, and longer.

Like the child writing a report on birds, the task ahead felt insurmountable. So how do you take a solid relational database and make sure that all those complex features scale? You take it bird by bird. We broke down the problem of scaling into five hundred smaller ones and started implementing these features one by one.

Ozgun Erdogan Sep 7, 2017

Databases and Distributed Deadlocks: A FAQ

Since Citus is a distributed database, we often hear questions about distributed transactions. Specifically, people ask us about transactions that modify data living on different machines.

So we started to work on distributed transactions. We then identified distributed deadlock detection as the building block to enable distributed transactions in Citus.

First some background: At Citus we focus on scaling out Postgres. We want to make Postgres performance & Postgres scale something you never have to worry about. We even have a cloud offering, a fully-managed database as a service, to make Citus even more worry-free. We carry the pager so you don’t have to and all that. And because we’ve built Citus using the PostgreSQL extension APIs, Citus stays in sync with all the latest Postgres innovations as they are released (aka we are not a fork.) Yes, we’re excited for Postgres 10 like all the rest of you :)

Back to distributed deadlocks: As we began working on distributed deadlock detection, we realized that we needed to clarify certain concepts. So we created a simple FAQ for the Citus development team. And we found ourselves referring back to the FAQ over and over again. So we decided to share it here on our blog, in the hopes you find it useful.

Marco Slot Aug 31, 2017

Five sharding data models and which is right

When it comes to scaling your database, there are challenges but the good news is that you have options. The easiest option of course is to scale up your hardware. And when you hit the ceiling on scaling up, you have a few more choices: sharding, deleting swaths of data that you think you might not need in the future, or trying to shrink the problem with microservices.

Deleting portions of your data is simple, if you can afford to do it. Regarding sharding there are a number of approaches and which one is right depends on a number of factors. Here we’ll review a survey of five sharding approaches and dig into what factors guide you to each approach.

Craig Kerstiens Aug 28, 2017

Introducing WAL-G by Citus: Faster Disaster Recovery for Postgres

A key part of running a reliable database service is ensuring you have a good plan for disaster recovery. Disaster recovery comes into play when disks or instances fail, and you need to be able to recover your data. In those type of cases logical backups, via pg_dump, may be days old and in such cases not ideal for you to restore from. To remove the risk of data loss, many of us turn to the Postgres WAL to keep safe.

Years ago Daniel Farina, now a principal engineer at Citus Data, authored a continuous archiving utility to make it easy for Postgres users to prepare for and recover from disasters. The tool, WAL-E, has been used to keep millions of Postgres databases safe. Today we’re excited to introduce an exciting new version of this tool: WAL-G. WAL-G, the successor to WAL-E, was created by a software engineering intern here at Citus Data, Katie Li, who is an undergraduate at UC Berkeley.

Craig Kerstiens Aug 18, 2017

Principles of Sharding for Relational Databases

When your database is small (10s of GB), it’s easy to throw more hardware at the problem and scale up. As these tables grows however, you need to think about other ways to scale your database.

In one way, sharding is the best way to scale. Sharding enables you to linearly scale your database’s cpu, memory, and disk resources by separating your database into smaller parts. In other ways, sharding is a controversial topic. The internet is full of advice on sharding, from “essential to scaling your database infrastructure” to “why you never want to shard”. So the question is, whose advice should you take?

Ozgun Erdogan Aug 9, 2017

Fork your distributed Postgres database with Citus

Having a database staging environment that is as close to production as possible is key to being able to test your app. This applies to both your code and to your database. Far too often a staging database is a forgotten child in your stack—not getting the same love and attention as your production instance. For some teams, their staging database is years old, or worse yet, their staging database is a 10 GB sample of a 2 TB production database.

What if you could easily have a full staging environment to experiment with, that is an exact copy of your production database? Even if that production database is 50 TB?

As of today on Citus Cloud—our fully-managed database as a service that is built to scale-out (and based on Postgres!)—you can get a full fork of your production database with the click of a button.

Craig Kerstiens Aug 4, 2017

Database Table Types with Citus and Postgres

Citus is Postgres that scales out horizontally. We do this by distributing queries across multiple Postgres servers—and as is often the case with scale-out architectures, this scale-out approach provides some great performance gains. And because Citus is an extension to Postgres, you get all the awesome features in Postgres such as support for JSONB, full-text search, PostGIS, and more.

The distributed nature of Citus gives you new flexibility when it comes to modeling your data. This is good. But you’ll need to think about how to model your data and what type of database tables to use. The way you query your data ultimately determines how you can best model each table. In this post, we’ll dive into the three different types of tables in Citus and how you should think about each.

Craig Kerstiens Jul 27, 2017

Customizing My Postgres Shell

As a developer your CLI is your home. You spend a lifetime of person-years in your shell and even small optimizations can pay major dividends to your efficiency. For anyone that works with Postgres and likely the psql editor, you should consider investing some love and care into psql. A little known fact is that psql has a number of options you can configure it with, and these configuration options can all live within an rc file called psqlrc in your home directory. Here is my .psqlrc file, which I’ve customized to my liking. Let’s walk through some of the commands within my .psqlrc file:

Craig Kerstiens Jul 16, 2017

