Blog posts tagged with 'performance' on the Citus Blog - Page 4

ZFS Private Beta on Citus Cloud

Written byBy Craig Kerstiens | July 19, 2018Jul 19, 2018

ZFS is a open source file system with the option to store data on disk in a compressed form. Itself ZFS supports a number of compression algorithms, giving you flexibility to optimize both performance and how much you store on disk. Compressing your data on disk offers two pretty straightforward advantages:

Reduce the amount of storage you need—thus reducing costs
When reading from disk, requires less data to be scanned, improving performance

To date, we have run Citus Cloud—our fully-managed database as a service that scales out Postgres horizontally—in production on EXT4. Today, we're excited to announce a limited beta program of ZFS support for our Citus Cloud database. ZFS makes Citus Cloud even more powerful for certain use cases. If you are interested in access to the beta contact us to get more info, or continue reading to learn more about the use cases where ZFS and Citus and Postgres can help.

Keep reading

Scalable incremental data aggregation on Postgres and Citus

Written byBy Marco Slot | June 14, 2018Jun 14, 2018

Many companies generate large volumes of time series data from events happening in their application. It’s often useful to have a real-time analytics dashboard to spot trends and changes as they happen. You can build a real-time analytics dashboard on Postgres by constructing a simple pipeline:

Load events into a raw data table in batches
Periodically aggregate new events into a rollup table
Select from the rollup table in the dashboard

For large data streams, Citus (an open source extension to Postgres that scales out Postgres horizontally) can scale out each of these steps across all the cores in a cluster of Postgres nodes.

One of the challenges of maintaining a rollup table is tracking which events have already been aggregated—so you can make sure that each event is aggregated exactly once. A common technique to ensure exactly-once aggregation is to run the aggregation for a particular time period after that time period is over. We often recommend aggregating at the end of the time period for its simplicity, but you cannot provide any results before the time period is over and backfilling is complicated.

Keep reading

Citus what is it good for? OLTP? OLAP? HTAP?

Written byBy Craig Kerstiens | June 7, 2018Jun 7, 2018

Earlier this week as I was waiting to begin a talk at a conference, I chatted with someone in the audience that had a few questions. They led off with this question: is Citus a good fit for X? The heart of what they were looking to figure out: is the Citus distributed database a better fit for analytical (data warehousing) workloads, or for more transactional workloads, to power applications? We hear this question quite a lot, so I thought I'd elaborate more on the use cases that make sense for Citus from a technical perspective.

Before I dig in, if you're not familiar with Citus; we transform Postgres into a distributed database that allows you to scale your Postgres database horizontally. Under the covers, your data is sharded across multiple nodes, meanwhile things still appear as a single node to your application. By appearing still like a single node database, your application doesn't need to know about the sharding. We do this as a pure extension to Postgres, which means you get all the power and flexibility that's included within Postgres such as JSONB, PostGIS, rich indexing, and more.

Keep reading

Faster bulk loading in Postgres with copy

Written byBy Craig Kerstiens | November 8, 2017Nov 8, 2017

If you've used a relational database, you understand basic INSERT statements. Even if you come from a NoSQL background, you likely grok inserts. Within the Postgres world, there is a utility that is useful for fast bulk ingestion: \copy. Postgres \copy is a mechanism for you to bulk load data in or out of Postgres.

First, lets pause. Do you even need to bulk load data and what's it have to do with Citus? We see customers leverage Citus for a number of different uses. When looking at Citus for a transactional workload, say as the system of record for some kind of multi-tenant SaaS app, your app is mostly performing standard insert/updates/deletes.

But when you're leveraging Citus for real-time analytics, you may already have a separate ingest pipeline. In this case you might be looking at event data, which can be higher in volume than a standard web app workload. If you already have an ingest pipeline that reads off Apache Kafka or Kinesis, you could be a great candidate for bulk ingest.

Back to our feature presentation: Postgres \copy. Copy is interesting because you can achieve much higher throughput than with single row inserts.

Keep reading

Fermi Estimates On Postgres Performance

Written byBy Samay Sharma | September 29, 2017Sep 29, 2017

A lot of people look to Citus for a solution that scales out their Postgres database, whether on-prem or as open source or in the cloud, as a fully-managed database as a service. And yet, a common question even before looking at Citus is: "what kind of performance can I get with Postgres?" The answer is: it depends. The performance you can expect from single node Postgres comes down to your workload, both on inserts and on the query side and how large that single node is. Unfortunately, "it depends" often leaves people a bit dissatisfied.

Fortunately, there are some fermi estimates, or in laymans terms ballpark, of what performance single node Postgres can deliver. These ballparks apply both to single-node Postgres, but from there you can start to get estimates of how much further you can go when scaling out with Citus. Let's walk through a simplified guide for what you should expect in terms of the read performance and ingest performance for queries in Postgres.

Keep reading

Postgres Parallel indexing in Citus

Written byBy Marco Slot | January 17, 2017Jan 17, 2017

Indexes are an essential tool for optimizing database performance and are becoming ever more important with big data. However, as the volume of data increases, index maintenance often becomes a write bottleneck, especially for advanced index types which use a lot of CPU time for every row that gets written. Index creation may also become prohibitively expensive as it may take hours or even days to build a new index on terabytes of data in postgres. As of Citus 6.0, we’ve made creating and maintaining indexes that much faster through parallelization.

Keep reading

Real-time event aggregation at scale using Postgres w/ Citus

Written byBy Marco Slot | November 29, 2016Nov 29, 2016

Citus is commonly used to scale out event data pipelines on top of PostgreSQL. Its ability to transparently shard data and parallelise queries over many machines makes it possible to have real-time responsiveness even with terabytes of data. Users with very high data volumes often store pre-aggregated data to avoid the cost of processing raw data at run-time. With Citus 6.0 this type of workflow became even easier using a new feature that enables pre-aggregation inside the database in a massively parallel fashion using standard SQL. For large datasets, querying pre-computed aggregation tables can be orders of magnitude faster than querying the facts table on demand.

Keep reading

COPY into distributed PostgreSQL tables, up to ~7M rows/sec

Written byBy Marco Slot | June 15, 2016Jun 15, 2016

In the recent 5.1 release of the Citus extension for PostgreSQL, we added the ability to use the COPY command to load data into distributed tables. PostgreSQL's COPY command is one of the most powerful bulk loading features of any database and it is even more powerful on distributed tables.

Keep reading

Interactive Analytics on GitHub Data using PostgreSQL with Citus

Written byBy Marco Slot | March 23, 2016Mar 23, 2016

With the Citus extension, you can use PostgreSQL to build applications that interactively query large data sets, for example analytical dashboards. At the same time, you can keep adding new data at high rates and use PostgreSQL's powerful indexing...

Keep reading

Testing Postgres vectorization for faster aggregations

Written byBy Umur Cubukcu | October 7, 2014Oct 7, 2014

One of the ideas we wanted to explore more has been speeding up in-memory aggregations in PostgreSQL through vectorized execution. The opportunity to do so came up when we had our intern, Can, take this on as a project during his summer internship...

Keep reading