Citus Data Blog

Thoughts on scaling out PostgreSQL, big data architectures, distributed systems, and the PostgreSQL community.

Dynamo, Citus, and Tradeoffs in Distributed Databases

Note: This is a guest blog post by Giuseppe “Pino” de Candia, the creator of Dynamo. We asked Pino to chime in with his thoughts on distributed databases and the trends he sees in this space. You can read more about Pino here.

When Ozgun, one of the founders of Citus Data, emailed me resources on scaling multi-tenant databases for B2B apps and asked me what I thought, all kinds of distributed systems tradeoffs started crossing my mind—along with memories of the forces that shaped Dynamo.

It’s been a decade since my team at Amazon worked on Dynamo, a highly available and scalable key-value store. By the time we started working on the project, Amazon was already going through two transitions.

Keep reading
Giuseppe Pino de Candia May 11, 2017

Scaling Connections in Postgres

There are a number of applications out there that have a high number of connections to Postgres. What’s high? That all depends on your application, but generally when you get to the few hundred connection area in Postgres you’re in the higher end. Anything in the thousands is definitely in the high territory, and even several hundred can put strain on your application. Generally a safe level for connections should be somewhere around 300-500 connections. This may seem low if you’re already running with thousands of connections, but it’s likely perfectly fine with pgBouncer taking care of the heavy lifting for you. Let’s drill into why a bit further.

Keep reading
Craig Kerstiens May 10, 2017

Postgres tips for Rails developers

This week at RailsConf, we found ourselves sharing a lot of tips for using PostgreSQL with Rails. We thought it might be worthwhile to write up many of these and share more broadly. Here you’ll find some tips that will help you in debugging and improving performance of your database from your Rails app.

And now, on to the code.

Keep reading
Lukas Fittl Apr 28, 2017

Analyzing PostgreSQL Email Archives with PostgreSQL

Note: This post was originally published in January, 2013. It has since been updated to have the appropriate commands you can use with Citus 6.1 which is open source and available to try via download or on Citus Cloud starting at $3 a day.

PostgreSQL’s Full Text Search capability is an excellent example of a powerful, relatively new feature that Citus users are able to leverage. First introduced in PostgreSQL 8.3 and continuously improved since then, Full Text Search (FTS) provides the SQL semantics necessary to run keyword searches over a corpus of documents stored in a database. FTS executes these operations efficiently by enabling the use of GIN and GiST indexes . In addition, from our standpoint, one of the best features of FTS is that Citus makes it linearly scalable. Specifically, with Citus one can double the size of the search dataset and still satisfy the same SLAs simply by doubling the number of machines in the Citus cluster.

In this blog post we wanted to demonstrate using Full Text Search with Citus, but in order to do so we first needed to find an interesting dataset to use in our examples. After hunting around we eventually hit on the idea of using email archives from the PostgreSQL mailing lists. Besides having a nicely self-reflexive quality about it, we also thought this would provide a good opportunity to learn more about the PostgreSQL community. Early in the history of PostgreSQL project, these lists became the primary communication mechanism for both developers and users, and the collected archives in turn provide a unique opportunity to get a complete view of almost everything that has happened in the project over the past fifteen years.

Keep reading
Carl Steinbach Apr 20, 2017

Dynamically resizing your Postgres cluster with Citus' shard rebalancer

One of the most common questions we get at Citus is how the rebalancer works–which is understandable. When you have an elastic scaled out database, how easy it is to scale is a key factor into how usable it will be. While we’ll happily take time for anyone that’s interested and live demo it, this walk through should give you a better idea of how it works for those of you that are curious.

Keep reading
Craig Kerstiens Apr 11, 2017

Distributed count(distinct) with HyperLogLog on Postgres

Running SELECT COUNT(DISTINCT) on your database is all too common. In applications it’s typical to have some analytics dashboard highlighting the number of unique items such as unique users, unique products, unique visits. While traditional SELECT COUNT(DISTINCT) queries works well in single machine setups, it is a difficult problem to solve in distributed systems. When you have this type of query, you can’t just push query to the workers and add up results, because most likely there will be overlapping records in different workers. Instead you can do:

  • Pull all distinct data to one machine and count there. (Doesn’t scale)
  • Do a map/reduce. (Scales but it’s very slow)

This is where approximation algorithms or sketches come in. Sketches are probabilistic algorithms which can generate approximate results efficiently within mathematically provable error bounds. There are a many of them out there, but today we’re just going to focus on one, HyperLogLog or HLL. HLL is very successfull for estimating unique number of elements in a list. First we’ll look some at the internals of the HLL to help us understand why HLL algorithm is useful to solve distict count problem in a scalable way, then how it can be applied in a distributed fashion. Then we will see some examples of HLL usage.

Keep reading
Burak Yucesoy Apr 4, 2017

How we implement Disaster Recovery and High Availability with Postgres on Citus Cloud

AWS is the leader when it comes to the cloud, and for good reason. AWS is well ahead in the quality and breadth of services they offer.

However, when a service is running at the scale of AWS, it is natural to expect some failures to occur. According to AWS EBS availability is designed for 99.999%.

The annual failure rate (AFR) is 0.1% - 0.2%, where failure means a complete or partial failure. For example, if you had 1,000 EBS discs, you should expect 1 or 2 to have a failure per year. In our experience, partial failure is significantly more common than a complete loss. Even so, a partial loss can take a lot of time to resolve and can still be debilitating to a business.

Over the years, there have been some AWS failures that made news headlines due to havoc caused for both companies and their users. These incidents put a spotlight on AWS’ imperfections.

Keep reading
Daniel Farina Mar 23, 2017

A Look at Isolating Tenants To Improve Database Performance

For many SaaS products, a common database problem is having one customer that has so much data, it adversely impacts other customers on the shared machine. This leads many to ask, “What do I do with my largest customer?”

Tenant isolation is a great way to solve this issue. Effectively it allows you to control which tenant or customer in particular you want to isolate on a completely new node. By separating a tenant, you get dedicated resources with more memory and cpu processing power.

Keep reading
Metin Doslu Mar 15, 2017

How to Scale PostgreSQL on AWS–Learnings from Citus Cloud

Citus is a distributed database that extends (not forks) PostgreSQL for large workloads. One challenge associated with building a distributed relational database (RDBMS) is that they require notable effort to deploy and operate. To remove these operational barriers, we’ve been thinking about offering Citus as a managed database for a while now.

Naturally, we were also worried that providing a native database offering on AWS could split our startup’s focus and take up significant engineering resources. (Honestly, if the founding engineers of the Heroku Postgres team didn’t join Citus, we might have decided to wait on this.) After having Citus Cloud publicly available for eight months though, we are now more bullish on the cloud then ever.

It turns out that targeting an important use case for your customers and delivering it to them in a way that removes their pain points, matters more than anything else. In this blog post, we’ll only focus on removing operational pain points and not on use cases: Why is cloud changing the way databases are delivered to customers? What AWS technologies Citus Cloud is using to enable that in a unique way?

Keep reading
Ozgun Erdogan Mar 10, 2017

A multi-tenant sharding tutorial

A number of SaaS applications have data models where they want to have their customers interact with only their data. At the enterprise end you have companies like Salesforce and Workday that fall into this bucket, but we see a ton of small ones as well. If you’re just getting started figuring out how you should approach your data so it can scale in the future, it doesn’t have to be hard.

Here we’re going to walk through an example data model that you can use as a basis for learning how you could apply the same to your own multi-tenant application.

Keep reading
Craig Kerstiens Mar 9, 2017

Page 1 of 11

Next page