General Questions

What is Citus Data about?

Citus DB is an analytics database that modifies and extends PostgreSQL for scalability. Users talk to Citus DB's master node as they do with a regular database; and the master node partitions the data and queries across worker nodes in the cluster. The specifics of the underlying architecture closely resemble those of Hadoop.

In other words, Citus DB combines the SQL expressiveness and performance of relational databases with the scalability and availability of Hadoop, in a single, uniform product.

When should I use CitusDB?

  • You want flexible access to your historic data (user actions, event streams, text logs, machine generated data) through SQL.
  • Parts of your data are growing at a pace that prohibits putting them into an expensive data warehouse.
  • You want to be able to scale your analytics, regardless of your data volume.
  • You want your customers to access their historic data with real-time responsiveness: Respond to complex aggregations over very large data sets within seconds, and to key-value lookups in under a second.

When should I not use CitusDB?

CitusDB scales out both longer running analytics queries and short requests such as real-time lookups/inserts/updates, but does not provide full traditional transactional semantics that Postgres does. That makes it is a great fit for ad-hoc, real-time analytics on timeseries data, but not ideal as the transactional backbone of your credit card processing system.

How fast is CitusDB?

CitusDB outperforms purpose-built analytics appliances by more than 10x. The graph below uses the industry standard TPC-H benchmark, and compares the performance of a CitusDB cluster running on 100 EC2 instances to a dedicated analytics appliance.

What use-cases and data sets are best suited to CitusDB?

CitusDB is optimized for performing ad-hoc analysis, standard reporting, and data exploration on your append-only event data; but not for modifying this data in real-time.

Data that has a natural temporal ordering (e.g. user actions, event data, text-based log files, machine generated data, clickstreams, ad impressions) and that grows rapidly is particularly well suited to CitusDB.

On these data, you can ask questions like:

  • Who are my most valuable/engaged customers, based on the activities they perform on my site?
  • Which group of users clicks on a given category of ads most often?
  • As an advertiser, what are all the sites I’ve published this specific ad on?
  • Is my CampaignA traffic converting better than my CampaignB traffic?
  • How many users in each age group used our newly released feature last month?
  • What are all the IP addresses involved with this specific event between these two dates?

Your queries can involve a particular range, join multiple tables together, filter based on complex selection criteria, group and sort results, perform aggregations, and execute other standard analytic functions.

How is CitusDB different than Hadoop?

CitusDB stores data in an extended PostgreSQL database; and therefore provides the SQL expressiveness and core performance benefits of databases (indexes, join optimizations, etc.) that are not available in Hadoop.

The database also enables real-time responsiveness. Simple queries can take as little as 100ms, and complex aggregations over large data sets complete within seconds.

How is CitusDB different than other analytics databases?

CitusDB is built from the start with true parallelism in mind; and can efficiently scale to 100s of nodes. Its software-only architecture allows it to run anywhere: on-premise or in the cloud, without any specific expectations from hardware.

We know the devil is not in the claims we make, but in the details. Watch our video to see CitusDB in action, and then get started with our sample data sets or your own data.

Is CitusDB able to scale to multiple cores of parallel query execution on the same node, or does that require multiple nodes?

A single CitusDB node stores multiple shards of the same distributed table. This enables CitusDB to use multiple cores for a single query by virtue of hitting multiple PostgreSQL tables (shards) on each node. However, to get true scalability in performance and reliability, we recommend a multi-node cluster. In cases where queries hit the disk, a single node setup can easily become disk I/O bound.

Does CitusDB support drivers for connecting to the database? If so, in which languages?

You can use standard PostgreSQL drivers and language bindings with CitusDB, which means almost any language is supported. You can view a list of supported drivers and interfaces for PostgreSQL here.

How does the columnar store extension (cstore_fdw) for PostgreSQL contrast/relate to your core product?

Since each instance in a CitusDB cluster is a nearly vanilla PostgreSQL (9.3), you can simply define the columnar store extension on each instance, and immediately have a scale-out, columnar analytics database for large volumes of data.

Does CitusDB support key-value pairs/semi-structured data?

PostgreSQL 9.3 comes with the hstore data type for storing key-value pairs, and it is supported with CitusDB as well. One important thing to keep in mind while using the hstore extension is that the extension needs to be loaded separately on the master as well as the worker nodes.

Running CitusDB

What infrastructure do I need to get started?

Since you can set up multiple instances of CitusDB on a single box, you can get started right away by downloading and installing it.

We have no special expectations from the underlying hardware (such as disk arrays, RAID setup, infiniband connections), which means you can use CitusDB on your commodity hardware, or get instances in the cloud.

The number of machines you need for production depends on (1) how much data you have and (2) what performance you need out of your queries (the configuration of each machine is also relevant of course). CitusDB allows you to get linear performance improvements for a given amount of data by adding more nodes, so you choose where you want to be on the performance vs. cluster size curve.

How can I guard against master node failures?

You can use PostgreSQL's streaming replication feature to replicate the master node's data in real-time. If the master fails, one of its slaves can then take over from where the master left off. For details on setting this up, please refer to the PostgreSQL wiki.

Are there any cluster monitoring tools which I can use with CitusDB?

For cluster monitoring, CitusDB can integrate with existing frameworks such as Ganglia, Nagios, or Sensu. This is because of the fact that many monitoring frameworks have integration points with PostgreSQL. Here are a few useful links to learn more about these technologies:

  • http://sensuapp.org/
  • https://github.com/sensu/sensu-community-plugins/tree/master/plugins/postgres
  • http://www.nagios.com/solutions/postgres-monitoring
  • http://exchange.nagios.org/directory/Plugins/Databases/PostgresQL
  • https://github.com/ganglia/gmetric/tree/master/database

I am trying to run approximate count(distinct) which requires installing the hyperloglog extension. How do I go about it?

CitusDB uses the HyperLogLog algorithm to calculate approximate values for count distinct queries. To enable distinct approximations, you will need to do the following:

  1. Install the latest citusdb-contrib package on every node. For example, if you are running Fedora on a 64-bit machine, you need to run the following commands on all the nodes in the cluster:
  2. Create the hll extension on every node.
  3. Enable count distinct approximations by setting the count_distinct_error_rate configuration value. We recommend setting this to 0.005. Lower values for this parameter give more accurate results.

What if I install as root and the installer can't determine the current user?

The installation will complete normally. Our installer will also create a new postgres user in that case; and will make that user own the database directories. To start up the database, you then need to switch to the postgres user by typing "su - postgres".

Common Errors

When I try to stage data, why do I get the error "could not connect to server: No route to host"?

You may have a firewall configured on your Linux instances. You may temporarily disable this firewall for testing purposes by running the command "sudo service iptables stop".

On RHEL 5.x, why does psql fail to find the symbols defined in the libreadline shared object?

On RHEL 5.x distributions, libreadline's symbols are provided by libncurses. You therefore need to explicitly specify this library through running the command "export LD_PRELOAD=/usr/lib/libncurses.so".