The Citus Blog | Citus Data

ZFS Private Beta on Citus Cloud

Written byBy Craig Kerstiens | July 19, 2018Jul 19, 2018

ZFS is a open source file system with the option to store data on disk in a compressed form. Itself ZFS supports a number of compression algorithms, giving you flexibility to optimize both performance and how much you store on disk. Compressing your data on disk offers two pretty straightforward advantages:

Reduce the amount of storage you need—thus reducing costs
When reading from disk, requires less data to be scanned, improving performance

To date, we have run Citus Cloud—our fully-managed database as a service that scales out Postgres horizontally—in production on EXT4. Today, we're excited to announce a limited beta program of ZFS support for our Citus Cloud database. ZFS makes Citus Cloud even more powerful for certain use cases. If you are interested in access to the beta contact us to get more info, or continue reading to learn more about the use cases where ZFS and Citus and Postgres can help.

Keep reading

Using search_path and views to hide columns for reporting with Postgres

Written byBy Sai Srirampur | July 3, 2018Jul 3, 2018

Data security and data privacy are important, no one disputes that. We all want to keep private things private and to keep our data secure. And yet, data needs to be shared, to enable insights, to help organizations observe patterns and have those “ah-ha” moments. None of us want the extreme where, in an effort to keep data secure, there is no access to data of any form within your organization, and the result is no business insights or analytics. With GDPR going into effect, you've likely been rethinking what security controls you have in place.

Here at Citus Data we collaborate with SaaS businesses and larger enterprises alike, generally to consult on Postgres data models and how to best scale out their database. (Our Citus extension to Postgres enables you to scale out Postgres horizontally. The benefit: performance.) In working with teams, one common thing we've seen companies do is to restrict who can see which bits of Personally Identifiable Information (PII) within your database. There are a number of approaches, including heavyweight ETL processes that mask PII bits. An ETL process tends to introduce a certain amount of latency from the time data is in your system until the time it can be analyzed.

Fortunately, Postgres provides a few primitives that can be used directly within your database to hide PII, while still enabling sophisticated analytics and exploration of data in real time.

Here we'll look at using Postgres schemas and views to provide access to data while keeping PII safe and hidden.

Keep reading

Options for scaling from 1 to 100,000 tenants

Written byBy Craig Kerstiens | June 28, 2018Jun 28, 2018

When you first start out in building a SaaS application you talk about that day in the future when you will have scaling problems, how that'll be the day, how that would be a good problem to have. You focus on getting the first few customers, making sure they have a great experience, and suddenly you're at 10s of customers, then 100s. You've upgraded your app server to a larger one, then you've gone from one ec2 app server to multiple ones with ELB in front of things. You've upgraded your Postgres database from an r3.large on AWS, to r3.xlarge, now you're eyeing that r3.2xlarge next month. In the back of your mind though, you're starting to look at your plans for future growth of your SaaS app, and you're wondering how much larger you can keep going. Your database is performing well at 100 tenants (tenants = customers), your back of the napkin math says you'll be able to scale your app up to 1,000 tenants, but after that you know you're going to have to explore some options.

What are those options and what are the trade-offs and benefits?

Keep reading

Fun with SQL: Functions in Postgres

Written byBy Craig Kerstiens | June 21, 2018Jun 21, 2018

In our previous Fun with SQL post on the Citus Data blog, we covered window functions. Window functions are a special class of function that allow you to grab values across rows and then perform some logic. By jumping ahead to window functions, we missed so many of the other handy functions that exist within Postgres natively. There are in fact several hundred built-in functions. And when needed, you can also create your own user defined functions (UDFs), if you need something custom. Today we're going to walk through just a small sampling of SQL functions that can be extremely handy in PostgreSQL.

Keep reading

How do you pronounce Citus?

Written byBy Craig Kerstiens | June 19, 2018Jun 19, 2018

It’s a common question we get at conferences, on calls, in meetings. “Citrus”, “Citius”, “Citus”, is that how you pronounce it? The quick and short of it is, we’re not named after a fruit. You pronounce it like “site-us”.

Most tend to leave it there, without wondering further. But a few do inquire as to the meaning. Citus’s name comes from the Olympic Motto “Citius, Altius, Fortius” which is Latin for “Faster, Higher, Stronger.” Our goal for the Citus extension is to be fast for both transactional and analytical workloads.

Keep reading

Scalable incremental data aggregation on Postgres and Citus

Written byBy Marco Slot | June 14, 2018Jun 14, 2018

Many companies generate large volumes of time series data from events happening in their application. It’s often useful to have a real-time analytics dashboard to spot trends and changes as they happen. You can build a real-time analytics dashboard on Postgres by constructing a simple pipeline:

Load events into a raw data table in batches
Periodically aggregate new events into a rollup table
Select from the rollup table in the dashboard

For large data streams, Citus (an open source extension to Postgres that scales out Postgres horizontally) can scale out each of these steps across all the cores in a cluster of Postgres nodes.

One of the challenges of maintaining a rollup table is tracking which events have already been aggregated—so you can make sure that each event is aggregated exactly once. A common technique to ensure exactly-once aggregation is to run the aggregation for a particular time period after that time period is over. We often recommend aggregating at the end of the time period for its simplicity, but you cannot provide any results before the time period is over and backfilling is complicated.

Keep reading

Configuring memory for Postgres

Written byBy Craig Kerstiens | June 12, 2018Jun 12, 2018

work_mem is perhaps the most confusing setting within Postgres. work_mem is a configuration within Postgres that determines how much memory can be used during certain operations. At its surface, the work_mem setting seems simple: after all, work_mem just specifies the amount of memory available to be used by internal sort operations and hash tables before writing data to disk. And yet, leaving work_mem unconfigured can bring on a host of issues. What perhaps is more troubling, though, is when you receive an out of memory error on your database and you jump in to tune work_mem, only for it to behave in an un-intuitive manner.

Keep reading

Citus what is it good for? OLTP? OLAP? HTAP?

Written byBy Craig Kerstiens | June 7, 2018Jun 7, 2018

Earlier this week as I was waiting to begin a talk at a conference, I chatted with someone in the audience that had a few questions. They led off with this question: is Citus a good fit for X? The heart of what they were looking to figure out: is the Citus distributed database a better fit for analytical (data warehousing) workloads, or for more transactional workloads, to power applications? We hear this question quite a lot, so I thought I'd elaborate more on the use cases that make sense for Citus from a technical perspective.

Before I dig in, if you're not familiar with Citus; we transform Postgres into a distributed database that allows you to scale your Postgres database horizontally. Under the covers, your data is sharded across multiple nodes, meanwhile things still appear as a single node to your application. By appearing still like a single node database, your application doesn't need to know about the sharding. We do this as a pure extension to Postgres, which means you get all the power and flexibility that's included within Postgres such as JSONB, PostGIS, rich indexing, and more.

Keep reading

Fun with SQL: Window functions in Postgres

Written byBy Craig Kerstiens | June 1, 2018Jun 1, 2018

Today we continue to explore all the powerful and fun things you can do with SQL. SQL is a very expressive language and when it comes to analyzing your data there isn't a better option. You can see the evidence of SQL's power in all the attempts made by NoSQL databases to recreate the capabilities of SQL. So why not just start with a SQL database that scales? (Like my favorites, Postgres and Citus.)

Today, in the latest post in our 'Fun with SQL' series (earlier blog posts were about recursive CTEs, generate_series, and relocating shards on a Citus database cluster), we're going to look at window functions in PostgreSQL. Window functions are key in various analytic and reporting use cases where you want to compare and contrast data. Window functions allow you to compare values between rows that are somehow related to the current row. Some practical uses of window functions can be:

Finding the first time all users performed some action
Finding how much each users bill increased or decreased from the previous month
Find where all users ranked for some sub-grouping

Keep reading

Citus 7.4: Move fast and reduce technical debt

Written byBy Ozgun Erdogan | May 24, 2018May 24, 2018

Today, we’re excited to announce the latest release of our distributed database, Citus 7.4! Citus scales out PostgreSQL through sharding, replication, and query parallelization.

Ever since we open sourced Citus as a Postgres extension, we have been incorporating your feedback into our database. Over the past two years, our release cycles went down from six to four to two months. As a result, we have announced 10 new Citus releases, where each release came with notable new features.

Shorter release cycles and more features came at a cost however. In particular, we added new distributed planner and executor logic to support different use cases for multi-tenant applications and real-time analytics. However, we couldn’t find the time to refactor this new logic. We found ourselves accumulating technical debt. Further, our distributed SQL coverage expanded over the past two years. With each year, we ended spending more and more time on testing each new release.

In Citus 7.4, we focused on reducing technical debt related to these items. At Citus, we track our development velocity with each release. While we fix bugs in every release, we found that a full release focused on addressing technical debt would help to maintain our release velocity. Also, a cleaner codebase leads to a happier and more productive engineering team.

Keep reading