Citus DB is an analytics database that modifies and extends PostgreSQL for scalability. Users talk to Citus DB's master node as they do with a regular database; and the master node partitions the data and queries across worker nodes in the cluster. The specifics of the underlying architecture closely resemble those of Hadoop.
In other words, Citus DB combines the SQL expressiveness and performance of relational databases with the scalability and availability of Hadoop, in a single, uniform product.
CitusDB scales out both longer running analytics queries and short requests such as real-time lookups/inserts/updates, but does not provide full traditional transactional semantics that Postgres does. That makes it is a great fit for ad-hoc, real-time analytics on timeseries data, but not ideal as the transactional backbone of your credit card processing system.
CitusDB outperforms purpose-built analytics appliances by more than 10x. The graph below uses the industry standard TPC-H benchmark, and compares the performance of a CitusDB cluster running on 100 EC2 instances to a dedicated analytics appliance.
CitusDB is optimized for performing ad-hoc analysis, standard reporting, and data exploration on your append-only event data; but not for modifying this data in real-time.
Data that has a natural temporal ordering (e.g. user actions, event data, text-based log files, machine generated data, clickstreams, ad impressions) and that grows rapidly is particularly well suited to CitusDB.
On these data, you can ask questions like:
Your queries can involve a particular range, join multiple tables together, filter based on complex selection criteria, group and sort results, perform aggregations, and execute other standard analytic functions.
CitusDB stores data in an extended PostgreSQL database; and therefore provides the SQL expressiveness and core performance benefits of databases (indexes, join optimizations, etc.) that are not available in Hadoop.
The database also enables real-time responsiveness. Simple queries can take as little as 100ms, and complex aggregations over large data sets complete within seconds.
CitusDB is built from the start with true parallelism in mind; and can efficiently scale to 100s of nodes. Its software-only architecture allows it to run anywhere: on-premise or in the cloud, without any specific expectations from hardware.
A single CitusDB node stores multiple shards of the same distributed table. This enables CitusDB to use multiple cores for a single query by virtue of hitting multiple PostgreSQL tables (shards) on each node. However, to get true scalability in performance and reliability, we recommend a multi-node cluster. In cases where queries hit the disk, a single node setup can easily become disk I/O bound.
You can use standard PostgreSQL drivers and language bindings with CitusDB, which means almost any language is supported. You can view a list of supported drivers and interfaces for PostgreSQL here.
Since each instance in a CitusDB cluster is a nearly vanilla PostgreSQL (9.3), you can simply define the columnar store extension on each instance, and immediately have a scale-out, columnar analytics database for large volumes of data.
PostgreSQL 9.3 comes with the hstore data type for storing key-value pairs, and it is supported with CitusDB as well. One important thing to keep in mind while using the hstore extension is that the extension needs to be loaded separately on the master as well as the worker nodes.
Since you can set up multiple instances of CitusDB on a single box, you can get started right away by downloading and installing it.
We have no special expectations from the underlying hardware (such as disk arrays, RAID setup, infiniband connections), which means you can use CitusDB on your commodity hardware, or get instances in the cloud.
The number of machines you need for production depends on (1) how much data you have and (2) what performance you need out of your queries (the configuration of each machine is also relevant of course). CitusDB allows you to get linear performance improvements for a given amount of data by adding more nodes, so you choose where you want to be on the performance vs. cluster size curve.
You can use PostgreSQL's streaming replication feature to replicate the master node's data in real-time. If the master fails, one of its slaves can then take over from where the master left off. For details on setting this up, please refer to the PostgreSQL wiki.
For cluster monitoring, CitusDB can integrate with existing frameworks such as Ganglia, Nagios, or Sensu. This is because of the fact that many monitoring frameworks have integration points with PostgreSQL. Here are a few useful links to learn more about these technologies:
CitusDB uses the HyperLogLog algorithm to calculate approximate values for count distinct queries. To enable distinct approximations, you will need to do the following:
The installation will complete normally. Our installer will also create a new postgres user in that case; and will make that user own the database directories. To start up the database, you then need to switch to the postgres user by typing "su - postgres".
You may have a firewall configured on your Linux instances. You may temporarily disable this firewall for testing purposes by running the command "sudo service iptables stop".
On RHEL 5.x distributions, libreadline's symbols are provided by libncurses. You therefore need to explicitly specify this library through running the command "export LD_PRELOAD=/usr/lib/libncurses.so".