Multiple Node Cluster

CitusDB's single and multiple node clusters execute the exact same logic. They both run independent master and worker databases that communicate over TCP/IP; and they primarily differ in that the single node cluster is bound by a single node's hardware resources. To overcome this limitation, CitusDB needs to installed using a distributed setup.

This distributed setup involves two steps that differ from those of a single node cluster. First, you need to configure authentication settings on the nodes to allow for them to talk to each other. Second, you need to either manually log in to all the nodes and issue commands, or rely on tools such as pssh to run the remote commands in parallel for you.

In the following, we assume that you will manually log in to all the nodes and issue commands. We also assume that we have one master and two worker nodes in our distributed setup, and refer to these nodes as master-100, worker-101, and worker-102. We additionally use the notation all-nodes when referring to commands that need to be repeated across all nodes in the cluster. Lastly, we note that you can edit the hosts file in /etc/hosts if your nodes don't already have their DNS names assigned.

Linux Nodes

As with the single node setup, we start by downloading and installing the Citus DB package. For example, if you are running Ubuntu 10.04+ on a 64-bit machine, you need to run the following commands on all the nodes in the cluster:

all-nodes# wget http://packages.citusdata.com/readline-6.0/citusdb-3.0.1-1.amd64.deb
all-nodes# sudo dpkg --install citusdb-3.0.1-1.amd64.deb

We now need to configure connection and authentication settings for installed databases. For this, you first instruct Citus DB to listen on all IP interfaces, and then configure the client authentication file to allow all incoming connections from the local network. Admittedly, these settings are too permissive for certain types of environments, and the PostgreSQL manual explains in more detail on how to restrict them further.

all-nodes# emacs -nw /opt/citusdb/3.0/data/postgresql.conf

# Uncomment listen_addresses for the changes to take effect
listen_addresses = '*'

all-nodes# emacs -nw /opt/citusdb/3.0/data/pg_hba.conf

# Allow unrestricted access to nodes in the local network. The following ranges
# correspond to 24, 20, and 16-bit blocks in Private IPv4 address spaces.
host    all             all             10.0.0.0/8            trust

After changing connection settings, you simply need to tell the master node about the workers by editing the master node's membership file. For this, you append each worker node's DNS name to the file. Please note that the name in the membership file should be an exact match to the resolved DNS name for the worker node.

master-100# emacs -nw /opt/citusdb/3.0/data/pg_worker_list.conf

# HOSTNAME     [PORT]     [RACK]
worker-101
worker-102

You are now ready to start up all databases in the cluster.

all-nodes# /opt/citusdb/3.0/bin/pg_ctl -D /opt/citusdb/3.0/data -l logfile start

After all databases start up, you can then connect to the master node and issue queries against the distributed cluster. We cover two such examples in the previous section. These examples are also applicable here; one point to highlight relates to data staging. CitusDB currently requires that you log into one of the worker nodes to stage data, and connect to the master node from the worker node using psql.


EC2 Multiple Node Cluster

Citus DB also has a publicly available machine image on EC2. As with other Amazon images, you launch this instance through the AWS Management Console. Once you sign in to this console, you will need to choose ami-975043fe as your machine image, and start up several instances based on this image. In our examples, we typically use three m2.2xlarge instances, but you're also welcome to try other configurations.

Besides the management console, you can also start up instances using the Amazon API tools. This approach requires more effort, but we still give instructions to launch instances this way for users who prefer to only use the shell. For those users, we assume that you already have the API tools installed and have your security credentials generated. If not, Ubuntu's EC2 Starters Guide provides great instructions on how to get started.

localhost# ec2-authorize default -p 22
localhost# ec2-run-instances ami-975043fe -n 3 -t m2.2xlarge -g default -k <ec2_keypair>
localhost# ec2-describe-instances

These commands first authorize ssh login access for the default security group, and then launch three Citus DB nodes. When starting up, these nodes mount one ephemeral disk at the /opt/citusdb mount point, and then install the database at /opt/citusdb/3.0/data. The nodes also have their postgresql.conf and pg_hba.conf edited so that the databases can talk to each other over TCP.

Once these instances come up, you can pick one node as the master, and log in to it. You then need to edit the membership file on this node, and tell it about the worker nodes in the cluster. For this, you append other instance's private DNS names to the membership file. From there on, we refer to this node as master-100.

localhost# ssh -i <private SSH key file> ec2-user@<ec2 master hostname>
master-100# emacs -nw /opt/citusdb/3.0/data/pg_worker_list.conf

# HOSTNAME     [PORT]     [RACK]
ip-worker-101.ec2.internal
ip-worker-102.ec2.internal

You are now ready to start up all databases in the cluster. You can either log in to these instances and issue commands to start up databases, or you can use tools such as pssh to run these commands in parallel for you.

all-nodes# /opt/citusdb/3.0/bin/pg_ctl -D /opt/citusdb/3.0/data -l logfile start

After all databases start up, you can then connect to the master node and issue queries against the distributed cluster. We cover two such examples in the previous section. These examples are also applicable here; one point to highlight relates to data staging. CitusDB currently requires that you log into one of the worker nodes to stage data, and connect to the master node from the worker node using psql.

For example, assuming that you already created a distributed table on the master node, you'd run the following command to stage data to the cluster from one of the worker nodes:

worker-101# /opt/citusdb/3.0/bin/psql -h master-100 -d postgres
postgres=# \STAGE customer_reviews FROM '/home/user/customer_reviews_1998.csv' (FORMAT CSV)