A multi-tenant sharding tutorial

Craig Kerstiens Mar 9, 2017

About Citus

Citus is a distributed database that scales out PostgreSQL. An extension to Postgres, Citus is available as open source, as on-prem software, and as a fully-managed database as a service.

Sign up for our newsletter

Enjoy what you're reading? Sign-up to our newsletter to stay informed:

Other Recent Posts

Migrating from single-node Postgres to Citus How Citus works (a look at dynamic executors) Citus 7: Transactions, Framework Integration, and Postgres 10 More Articles

Like our blog, or have a question about Citus? Join us on Slack for a chat :)

A number of SaaS applications have data models where they want to have their customers interact with only their data. At the enterprise end you have companies like Salesforce and Workday that fall into this bucket, but we see a ton of small ones as well. If you’re just getting started figuring out how you should approach your data so it can scale in the future, it doesn’t have to be hard.

Here we’re going to walk through an example data model that you can use as a basis for learning how you could apply the same to your own multi-tenant application.

First the schema. For this application we’re going to assume it’s an ad network. Here you’ll have multiple advertisers come on board and each will have their own data around ads, impressions, etc.

CREATE TABLE companies (
    id bigint NOT NULL,
    name text NOT NULL,
    image_url text,
    created_at timestamp without time zone NOT NULL,
    updated_at timestamp without time zone NOT NULL

CREATE TABLE campaigns (
    id bigint NOT NULL,
    company_id bigint NOT NULL,
    name text NOT NULL,
    cost_model text NOT NULL,
    state text NOT NULL,
    monthly_budget bigint,
    blacklisted_site_urls text[],
    created_at timestamp without time zone NOT NULL,
    updated_at timestamp without time zone NOT NULL

    id bigint NOT NULL,
    company_id bigint NOT NULL,
    campaign_id bigint NOT NULL,
    name text NOT NULL,
    image_url text,
    target_url text,
    impressions_count bigint DEFAULT 0,
    clicks_count bigint DEFAULT 0,
    created_at timestamp without time zone NOT NULL,
    updated_at timestamp without time zone NOT NULL

Now we’re going to add a few primary keys just as we would for any other application:

ALTER TABLE campaigns ADD PRIMARY KEY (id, company_id);
ALTER TABLE ads ADD PRIMARY KEY (id, company_id);

You’ll notice the primary keys we have are on campaigns and ads, this later will allow us to have full foreign key constraints to ensure all the company data will be located together. You can read more about how co-location makes this possible in this post.

And finally the part that makes it all scale horizontally. We’re going to set the tables up as sharded within Citus. To do this, we’re going to run one function for each table we have. This function will tell the database to shard the tables based on the key we specify. In this case we’re going to use the company_id across all tables so they get located together:

SELECT create_distributed_table('companies', 'id');
SELECT create_distributed_table('campaigns', 'company_id');
SELECT create_distributed_table('ads', 'company_id');

With our schema all setup lets grab some sample data and load it up:

curl https://examples.citusdata.com/tutorial/companies.csv > companies.csv
curl https://examples.citusdata.com/tutorial/campaigns.csv > campaigns.csv
curl https://examples.citusdata.com/tutorial/ads.csv > ads.csv

Then from within your Citus cluster you can \copy in the data:

\copy companies from 'companies.csv' with csv;
\copy campaigns from 'campaigns.csv' with csv;
\copy ads from 'ads.csv' with csv;

Now you’ve got your sharded database all setup, you can start running some queries against it. Let’s find the monthly ad budget for a particular company:

SELECT name, cost_model, state, monthly_budget
FROM campaigns
WHERE company_id = 5
ORDER BY monthly_budget DESC

You can also join across other tables we’ve sharded:

SELECT campaigns.id, campaigns.name, campaigns.monthly_budget,
       sum(impressions_count) as total_impressions, sum(clicks_count) as total_clicks
FROM ads, campaigns
WHERE ads.company_id = campaigns.company_id
AND campaigns.company_id = 5
AND campaigns.state = 'running'
GROUP BY campaigns.id, campaigns.name, campaigns.monthly_budget
ORDER BY total_impressions, total_clicks;

The key for queries, is to ensure when you’re querying and joining that you have the tenant id, in this case the company_id explicitly expressed on each query.

If you’re curious to read a bit more on how to migrate your own application give our docs a read or consider taking look at our gem activerecord-multi-tenant which simplifies much of this for you/

← Next article Previous article →