If you want to learn more about Citus on Microsoft Azure, read this post about Hyperscale (Citus) on Azure Database for PostgreSQL.
A number of SaaS applications have data models where they want to have their customers interact with only their data. At the enterprise end you have companies like Salesforce and Workday that fall into this bucket, but we see a ton of small ones as well. If you’re just getting started figuring out how you should approach your data so it can scale in the future, it doesn’t have to be hard.
Here we’re going to walk through an example data model that you can use as a basis for learning how you could apply the same to your own multi-tenant application.
First the schema. For this application we’re going to assume it’s an ad network. Here you’ll have multiple advertisers come on board and each will have their own data around ads, impressions, etc.
CREATE TABLE companies ( id bigint NOT NULL, name text NOT NULL, image_url text, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL ); CREATE TABLE campaigns ( id bigint NOT NULL, company_id bigint NOT NULL, name text NOT NULL, cost_model text NOT NULL, state text NOT NULL, monthly_budget bigint, blacklisted_site_urls text, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL ); CREATE TABLE ads ( id bigint NOT NULL, company_id bigint NOT NULL, campaign_id bigint NOT NULL, name text NOT NULL, image_url text, target_url text, impressions_count bigint DEFAULT 0, clicks_count bigint DEFAULT 0, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL );
Now we’re going to add a few primary keys just as we would for any other application:
ALTER TABLE companies ADD PRIMARY KEY (id); ALTER TABLE campaigns ADD PRIMARY KEY (id, company_id); ALTER TABLE ads ADD PRIMARY KEY (id, company_id);
You’ll notice the primary keys we have are on
ads, this later will allow us to have full foreign key constraints to ensure all the company data will be located together. You can read more about how co-location makes this possible in this post.
And finally the part that makes it all scale horizontally. We’re going to set the tables up as sharded within Citus. To do this, we’re going to run one function for each table we have. This function will tell the database to shard the tables based on the key we specify. In this case we’re going to use the
company_id across all tables so they get located together:
SELECT create_distributed_table('companies', 'id'); SELECT create_distributed_table('campaigns', 'company_id'); SELECT create_distributed_table('ads', 'company_id');
With our schema all setup lets grab some sample data and load it up:
curl https://examples.citusdata.com/tutorial/companies.csv > companies.csv curl https://examples.citusdata.com/tutorial/campaigns.csv > campaigns.csv curl https://examples.citusdata.com/tutorial/ads.csv > ads.csv
Then from within your Citus cluster you can
\copy in the data:
\copy companies from 'companies.csv' with csv; \copy campaigns from 'campaigns.csv' with csv; \copy ads from 'ads.csv' with csv;
Now you’ve got your sharded database all setup, you can start running some queries against it. Let’s find the monthly ad budget for a particular company:
SELECT name, cost_model, state, monthly_budget FROM campaigns WHERE company_id = 5 ORDER BY monthly_budget DESC LIMIT 10;
You can also join across other tables we’ve sharded:
SELECT campaigns.id, campaigns.name, campaigns.monthly_budget, sum(impressions_count) as total_impressions, sum(clicks_count) as total_clicks FROM ads, campaigns WHERE ads.company_id = campaigns.company_id AND campaigns.company_id = 5 AND campaigns.state = 'running' GROUP BY campaigns.id, campaigns.name, campaigns.monthly_budget ORDER BY total_impressions, total_clicks;
The key for queries, is to ensure when you’re querying and joining that you have the tenant id, in this case the
company_id explicitly expressed on each query.