Citus 11 is out! Now 100% open source. Read all about it in Marco’s release blog. 💥
When your database is small (10s of GB), it’s easy to throw more hardware at the problem and scale up. As these tables grows however, you need to think about other ways to scale your database.
In one way, sharding is the best way to scale. Sharding enables you to linearly scale your database’s cpu, memory, and disk resources by separating your database into smaller parts. In other ways, sharding is a controversial topic. The internet is full of advice on sharding, from “essential to scaling your database infrastructure” to “why you never want to shard”. So the question is, whose advice should you take?
We always knew when the topic of sharding came up, the answer was, “it depends.”
The theory of sharding is simple: Pick one key (column) that evenly distributes your data. Make sure that most of your queries can be addressed by that key. This theory is simple, but once you dive into sharding your database, the practice becomes messy.
At Citus, we helped hundreds of teams as they looked into sharding their databases. As we helped them, we saw that some key patterns emerged.
In this blog post, we’ll first look at key properties that impact a sharding project’s success. Then we’ll dig into the underlying reason why opinions on sharding differ from one another. When it comes to sharding a mature database, the type of application you’re building impacts your success more than anything else.
We found that when you think about sharding your database, three key properties impacted your project’s success. The following diagram shows those properties on three axes and also gives well-known company names as examples.
The x-axis in the diagram shows the workload type. This axis starts with transactional workloads on the left and continues onto data warehousing on the right. This dimension is the most recognized one when making scaling decisions.
The z-axis demonstrates another important property: where in your application lifecycle are you? How many tables do you have in your database (10, 100, 1000) or how long has your application been in production? An application that’s been running on PostgreSQL for a few months will have a way easier time with sharding than an app that’s been in production for years.
At Citus, we find that the majority of users who talk to us have fairly developed applications. When you have a mature application, the y-axis in the diagram becomes crucial. Sadly, this dimension isn’t as recognized as the two other ones. In fact, most articles that reach contradicting conclusions about sharding do so because they provide their recommendation within the context of one application type.
The y-axis in the diagram shows the most important property when sharding mature database setups: your application type. In this axis, we have B2B applications at the top, whose data models are more amenable to sharding. At the bottom of this axis, we have B2C applications such as Amazon and Facebook that require more work to shard. In the following, we pick three well-known companies and talk about their differences.
A good example of a B2B application is CRM software. When you’re building a CRM app, such as Salesforce, your application is going to serve other customers. For example, GE Aviation, will be one of your customers using Salesforce.
Within GE Aviation, they have users who log into their company dashboard. GE also captures leads who are people they could do business with, contacts who are people that they know and already have a business relationship with, accounts who represent business entities and who have contacts working for them, and opportunities which are sales events related to an account and one or more contacts. When you map these complex relationships within a database, the mapping looks like the following graph.
This graph looks complex at first. If you spend more time looking into it however, you’ll see that most tables originate from the customer table. You can then transform this graph into the following one by adding a customer_id column to all your tables.
With this simple transformation, your database now has a good sharding key: customer_id. This sharding key usually distributes the data evenly; and most queries to the database will include the customer key. Further, you can colocate your tables on customer_id, and continue to use key relational database features such as transactions, table joins, and foreign key constraints.
In other words, if you have B2B application, the nature of your data gives you a fundamental advantage with sharding.
Amazon.com is a good example of a mature B2C application. If you were building the Amazon.com website today, you have several concepts to think about. First, a user comes to your website and starts looking at you products, such as books or electronics. When the user visits a product page, say Harry Potter 7, they see catalog information related to that product. Example catalog info includes Harry Potter 7’s author, price, cover picture, and other images.
When your user logs into your website, they then start accessing user related data. The user needs to be authenticated, can write reviews on their favorite products, and can add items to their shopping cart. At some point, the user decides to buy items in their shopping cart; and places an order. The order now gets processed, fulfilled from a warehouse, and becomes a shipment.
When you map these relationships within your relational database, you find that they differ from the Salesforce example in one important way. You don’t have a single dimension that’s the center of all your relationships, you have at least three: catalog, user, and order data.
When you look to shard this type of B2C data, one option is to refactor your application into micro-services. For example, you have catalog related services that own catalog and offer data, and user related services that own authentication and shopping cart data. The APIs between services then define the boundaries for accessing the underlying data.
When you create this separation between your data, you can then shard the data that powers each service or group of services separately. In fact, Amazon.com used a similar approach to sharding when they transitioned to a service-oriented architecture.
This approach to sharding has a significantly different benefit to cost ratio than sharding a B2B application. On the benefits side, when you separate your data into groups this way, you can rely on your database to join data from different sources or provide transactions and constraints across data groups. On the costs side, you now need to shard not one, but multiple groups of data.
One sub-category that sits between B2B and B2C include applications such as Postmates, Instacart, or Lyft. For example, Instacart delivers groceries to users from local stores. In one sense, Instacart is similar to the Amazon.com example. Instacart has three dimensions central to their application: local stores that offer groceries, users who order those groceries, and drivers that deliver them. So, it’s hard to pick one key on which to shard their database.
When you have a mature B2C2C application like Instacart, you may follow a different strategy. You may find that most of your database tables share a different dimension: geography. In this case, you may pick the city or geography as your sharding key, and shard your tables on the geography key.
In general, sharding a B2B2C / B2C2C application lives in the middle of the sharding spectrum. Sharding for B2B2C tends to have a higher benefit-cost ratio than sharding B2C apps, and a lower one than B2B apps.
The internet is full of different opinions on sharding. We found that most of these opinions are formed with one application or application type in mind. In fact, your application type (B2B or B2C) impacts your success with sharding more than anything else. In particular, if you have a B2B application, you’re going to have a much easier time sharding your relational database.
Of course, this insight alone doesn’t answer all questions related to sharding. When you’re thinking about scaling your database, you need a complete view that evaluates all options with your project’s requirements in mind. If you’re thinking of scaling or sharding your database, drop us a line and we’d be happy to chat!