POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Craig Kerstiens
May 22, 2018
Update in July 2023: Citus 12 allows you to shard your database by schema, or if you prefer, you can still use row-based sharding. Find out more in the Citus 12 blog post.
We spend a lot of time with companies that are growing fast, or planning for future growth. It may be you've built your product and are now just trying to keep the system growing and scaling to handle new users and revenue. Or you may be still building the product, but know that an even moderate level of success could lead to a lot of scaling. In either case where you spend your time is key in order to not lose valuable time.
As Donald Knuth states it in Computer Programming as an Art:
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
With the above in mind one of the most common questions we get is: What do I need to do now to make sure I can scale my multi-tenant application later?
We've written some before about approaches not to take such as schema based sharding or one database per customer and the trade-offs that come with that approach. Here we'll dig into three key steps you should take that won't be wasted effort should the need to scale occur.
In the world of databases you can learn all about normalization. Normalization has various forms but in simple terms is not duplicating data, rather each unique piece of data has a key to identify it and you reference that key that has the association to that data. In the reporting world denormalization can be extremely common, allowing you to generate reports faster. For multi-tenant applications going somewhere middle of the road is ideal, the best step you can take early is ensuring you have your tenant_id
on every single table. Suppose you a basic CRM schema:
CREATE TABLE leads (
id serial primary key,
first_name text,
last_name text,
email text
);
CREATE TABLE accounts (
id serial primary key,
name text,
state varchar(2),
size int
);
CREATE TABLE opportunity (
id serial primary key,
name text,
amount int
);
To plan to scale later a key step would be to add the tenant_id
onto every table, in this case it'll be org_id
:
CREATE TABLE leads (
id serial primary key,
first_name text,
last_name text,
email text,
org_id int
);
CREATE TABLE accounts (
id serial primary key,
name text,
state varchar(2),
size int,
org_id int
);
CREATE TABLE opportunity (
id serial primary key,
name text,
amount int,
org_id int
);
In addition to adding the tenant_id
it's a good practice to then use it as part of your primary and foreign keys. Good thing for you composite keys work great in Postgres. Instead of putting your primary key then on a single column you specify it at the end of the table creation:
CREATE TABLE leads (
id serial,
first_name text,
last_name text,
email text,
org_id int,
primary key (org_id, id)
);
By denormalizing a little and updating your keys to be composite ones you're in pretty good shape to scale out when the time comes. But there is one more bit of work you can put in early that won't be wasted.
A single connection to a Postgres database consumes roughly 10 MB of overhead. Most web application frameworks such as Rails and Django keep a pool of connections or persist them so when they have a new request there is less time spent connecting to the database. This reduction in time comes at the cost of using more resources from your database. When you're early this may be an okay trade-off, but if you ever need to scale this will become a bottleneck.
The ideal is to setup proper connection management via something like a third party connection pooler such as pgbouncer. Even if you're not ready to put in pgbouncer today you should be monitoring both your active and idle connections to make sure this doesn't sneak up on you.
The best part on each of the above steps is that by doing them early, the effort is minimal. To refactor later, the time it will take becomes proportional to how large and complex your app has grown. The best technical investments are ones that can be made gradually with minimal effort, but still provide compounding gains over the long term. If you have questions on planning for scale or are already running into scaling issues with your Postgres database, feel free to reach out to our database team here at Citus to see if we can help.