FAQs About Citus Data Software Solutions

What is Citus Data about?

CitusDB is an analytics database that modifies and extends PostgreSQL for scalability. Users talk to Citus DB's master node as they do with a regular database; and the master node partitions the data and queries across worker nodes in the cluster. The specifics of the underlying architecture closely resemble those of Hadoop.

In other words, CitusDB combines the SQL expressiveness and performance of relational databases with the scalability and availability of Hadoop, in a single, uniform product.

When should I use CitusDB?

  • You want flexible access to your historic data (user actions, event streams, text logs, machine generated data) through SQL.
  • Parts of your data are growing at a pace that prohibits putting them into an expensive data warehouse.
  • You want to be able to scale your analytics, regardless of your data volume.
  • You want your customers to access their historic data with real-time responsiveness: Respond to complex aggregations over very large data sets within seconds, and to key-value lookups in under a second.

When should I not use CitusDB?

CitusDB scales out both longer running analytics queries and short requests such as real-time lookups/inserts/updates, but does not provide full traditional transactional semantics that Postgres does. That makes it is a great fit for ad-hoc, real-time analytics on timeseries data, but not ideal as the transactional backbone of your credit card processing system.

How fast is CitusDB?

CitusDB outperforms purpose-built analytics appliances by more than 10x. The graph below uses the industry standard TPC-H benchmark, and compares the performance of a CitusDB cluster running on 100 EC2 instances to a dedicated analytics appliance.

What use-cases and data sets are best suited to CitusDB?

CitusDB is optimized for performing ad-hoc analysis, standard reporting, and data exploration on your historic event data.

Data that has a natural temporal ordering (e.g. user actions, event data, text-based log files, machine generated data, clickstreams, ad impressions) and that grows rapidly is particularly well suited to CitusDB.

On these data, you can ask questions like:

  • Who are my most valuable/engaged customers, based on the activities they perform on my site?
  • Which group of users clicks on a given category of ads most often?
  • As an advertiser, what are all the sites I’ve published this specific ad on?
  • Is my CampaignA traffic converting better than my CampaignB traffic?
  • How many users in each age group used our newly released feature last month?
  • What are all the IP addresses involved with this specific event between these two dates?

Your queries can involve a particular range, join multiple tables together, filter based on complex selection criteria, group and sort results, perform aggregations, and execute other standard analytic functions.

How different is CitusDB from PostgreSQL?

CitusDB isn’t a fork of PostgreSQL; it simply extends Postgres to support distributed SQL queries. Also, CitusDB version numbers are aligned with major version upgrades in Postgres; CitusDB v2.0 was based on PostgreSQL 9.2, v3.0 is based on PostgreSQL 9.3 and v4.0 is based on PostgreSQL 9.4.

On top of PostgreSQL, CitusDB comes with its own replication, distributed query planner and executor logic which enable execution of distributed SQL queries in parallel. This adds Hadoop-like fault tolerance, scalability and recovery from mid-query failures to CitusDB.

How is CitusDB different than Hadoop?

CitusDB stores data in an extended PostgreSQL database; and therefore provides the SQL expressiveness and core performance benefits of databases (indexes, join optimizations, etc.) that are not available in Hadoop.

The database also enables real-time responsiveness. Simple queries can take as little as 100ms, and complex aggregations over large data sets complete within seconds.

How is CitusDB different than other analytics databases?

CitusDB is built from the start with true parallelism in mind; and can efficiently scale to 100s of nodes. Its software-only architecture allows it to run anywhere: on-premise or in the cloud, without any specific expectations from hardware.

We know the devil is not in the claims we make, but in the details. Watch our video to see CitusDB in action, and then get started with our sample data sets or your own data.

Can I use BI tools like Tableau with CitusDB?

CitusDB supports visualization tools like Tableau through standard Postgres ODBC/JDBC drivers. Any other BI tools which use standard Postgres drivers can be used with CitusDB. One common issue people run into while using BI tools with CitusDB is that CitusDB does not currently support PREPARE statements. To get rid of this error, you need to configure your ODBC/JDBC driver to change the protocol version and avoid using PREPARE statements.

Is CitusDB able to scale to multiple cores of parallel query execution on the same node, or does that require multiple nodes?

A single CitusDB node stores multiple shards of the same distributed table. This enables CitusDB to use multiple cores for a single query by virtue of hitting multiple PostgreSQL tables (shards) on each node. However, to get true scalability in performance and reliability, we recommend a multi-node cluster. In cases where queries hit the disk, a single node setup can easily become disk I/O bound.

Does CitusDB support drivers for connecting to the database? If so, in which languages?

You can use standard PostgreSQL drivers and language bindings with CitusDB, which means almost any language is supported. You can view a list of supported drivers and interfaces for PostgreSQL here.

How does the columnar store extension (cstore_fdw) for PostgreSQL contrast/relate to your core product?

Since each instance in a CitusDB cluster is a nearly vanilla PostgreSQL (9.4), you can simply define the columnar store extension on each instance, and immediately have a scale-out, columnar analytics database for large volumes of data.

Does CitusDB support key-value pairs/semi-structured data?

PostgreSQL 9.4 comes with the hstore data type for storing key-value pairs, and it is supported with CitusDB as well. One important thing to keep in mind while using the hstore extension is that the extension needs to be loaded separately on the master as well as the worker nodes.

Running CitusDB

Please see our updated documentation for this FAQ.

Common Errors

Please see our updated documentation for a list of common errors and their solutions.