CitusDB is an analytics database that modifies and extends PostgreSQL for scalability. Users talk to Citus DB's master node as they do with a regular database; and the master node partitions the data and queries across worker nodes in the cluster. The specifics of the underlying architecture closely resemble those of Hadoop.
In other words, CitusDB combines the SQL expressiveness and performance of relational databases with the scalability and availability of Hadoop, in a single, uniform product.
CitusDB scales out both longer running analytics queries and short requests such as real-time lookups/inserts/updates, but does not provide full traditional transactional semantics that Postgres does. That makes it is a great fit for ad-hoc, real-time analytics on timeseries data, but not ideal as the transactional backbone of your credit card processing system.
CitusDB outperforms purpose-built analytics appliances by more than 10x. The graph below uses the industry standard TPC-H benchmark, and compares the performance of a CitusDB cluster running on 100 EC2 instances to a dedicated analytics appliance.
CitusDB is optimized for performing ad-hoc analysis, standard reporting, and data exploration on your historic event data.
Data that has a natural temporal ordering (e.g. user actions, event data, text-based log files, machine generated data, clickstreams, ad impressions) and that grows rapidly is particularly well suited to CitusDB.
On these data, you can ask questions like:
Your queries can involve a particular range, join multiple tables together, filter based on complex selection criteria, group and sort results, perform aggregations, and execute other standard analytic functions.
CitusDB isn’t a fork of PostgreSQL; it simply extends Postgres to support distributed SQL queries. Also, CitusDB version numbers are aligned with major version upgrades in Postgres; CitusDB v2.0 was based on PostgreSQL 9.2, v3.0 is based on PostgreSQL 9.3 and v4.0 is based on PostgreSQL 9.4.
On top of PostgreSQL, CitusDB comes with its own replication, distributed query planner and executor logic which enable execution of distributed SQL queries in parallel. This adds Hadoop-like fault tolerance, scalability and recovery from mid-query failures to CitusDB.
CitusDB stores data in an extended PostgreSQL database; and therefore provides the SQL expressiveness and core performance benefits of databases (indexes, join optimizations, etc.) that are not available in Hadoop.
The database also enables real-time responsiveness. Simple queries can take as little as 100ms, and complex aggregations over large data sets complete within seconds.
CitusDB is built from the start with true parallelism in mind; and can efficiently scale to 100s of nodes. Its software-only architecture allows it to run anywhere: on-premise or in the cloud, without any specific expectations from hardware.
CitusDB supports visualization tools like Tableau through standard Postgres ODBC/JDBC drivers. Any other BI tools which use standard Postgres drivers can be used with CitusDB. One common issue people run into while using BI tools with CitusDB is that CitusDB does not currently support PREPARE statements. To get rid of this error, you need to configure your ODBC/JDBC driver to change the protocol version and avoid using PREPARE statements.
A single CitusDB node stores multiple shards of the same distributed table. This enables CitusDB to use multiple cores for a single query by virtue of hitting multiple PostgreSQL tables (shards) on each node. However, to get true scalability in performance and reliability, we recommend a multi-node cluster. In cases where queries hit the disk, a single node setup can easily become disk I/O bound.
You can use standard PostgreSQL drivers and language bindings with CitusDB, which means almost any language is supported. You can view a list of supported drivers and interfaces for PostgreSQL here.
Since each instance in a CitusDB cluster is a nearly vanilla PostgreSQL (9.4), you can simply define the columnar store extension on each instance, and immediately have a scale-out, columnar analytics database for large volumes of data.
PostgreSQL 9.4 comes with the hstore data type for storing key-value pairs, and it is supported with CitusDB as well. One important thing to keep in mind while using the hstore extension is that the extension needs to be loaded separately on the master as well as the worker nodes.
Please see our updated documentation for this FAQ.
Please see our updated documentation for a list of common errors and their solutions.