Scaling out PostgreSQL at CloudFlare with CitusDB

CloudFlare is a content delivery network (CDN) and DNS provider that powers millions of websites around the world. Last week, we were happy to see them publish a technical blog post that described how they power their analytics dashboards using CitusDB!

We also had three important takeaways from the CloudFlare post. First, CloudFlare has been using PostgreSQL since the early days. They trusted the database, had extensive experience running it, and knew how to use its tools for backups and upgrades. CloudFlare also found two PostgreSQL extensions helpful: hstore for semi-structured data, and HyperLogLog for fast count distinct approximations. CloudFlare wanted to keep using these Postgres tools and extensions without making any changes to their application layer, and also make PostgreSQL scale. CitusDB enabled them to do just that.

Second, we find that developers are looking to scale out PostgreSQL for different workloads. In case of CloudFlare, they were looking to parallelize their business facing dashboards in real-time. For example, when a CloudFlare business wanted to see the number of unique threats to their website over varying time intervals, these graphs had to be rendered in real-time. For that, incoming queries had to be parallelized across numerous machines and CPU cores.

Finally, we see different approaches to capturing high-volume events data and presenting insights through dashboards. One approach is to insert raw events data into the database, and have enough hardware to generate query results in real-time. A second approach is to capture events data, partition it on a dimension other than time, and then use materialized views to aggregate that data on the time dimension.

Another approach, one that's cost-efficient when more than 5% of internet's traffic is flowing through your infrastructure, is using Kafka queues and Go aggregators to capture 1-minute data in your scale-out PostgreSQL cluster. You can then aggregate that data further within the database.

CloudFlare Log Processing Infrastructure

The diagram above highlights the last approach, with specific details from CloudFlare's architecture. In practice, our customers who have real-time analytic workloads could be using any one of the three approaches above -- thanks to PostgreSQL's flexible architecture!

Also, we're currently revamping our documentation to include these three example architectures for real-time analytics. In the meantime, if you have any questions about using CitusDB, please get in touch with us!

pgDaySF 2015 Takeaways

We recently attended pgDaySF, the one-day PostgreSQL event organized by the San Francisco PostgreSQL User Group and part of the bigger FOSS4G conference. The event was very well attended, and we were excited by the content of several presentations.

  • Bruce Momjian offered a very interesting talk on the PostgreSQL planner. He summarized the three decisions a planner makes—scan method, join order, and join method—and then gave an example of each.
  • During the Lightning Talks, David Haynes from the University of Minnesota discussed a project that involves collecting census, agriculture, and climate data from most governments in the world. To optimize compression, the team is putting the data into a PostgreSQL columnar store, which happens to be cstore_fdw.
  • PostGIS committer Paul Ramsey, who works for CartoDB, delivered a very well-attended talk on basic and more advanced ways to use PostGIS/PostgreSQL to process spatial data, build infrastructures, and more.

We also gave a well-attended presentation on pg_shard: Shard and scale out PostgreSQL that generated a lot of interest. In our talk, we showed how the concept of logical sharding helps with dynamically scaling out a cluster. We also talked about how pg_shard leverages PostgreSQL's extension APIs to be fully compatible with the database. Finally, we showed a demo at the end of our talk of one of the largest European retailers that is using CitusDB to power their dashboards. This retailer has many delivery trucks sending GPS data, and the demo showed how they use CitusDB to visualize real-time heatmaps overlayed on a map of Europe. After the demo, we took questions from the audience, and noted that we've seen this heatmap easily scale to 35M vehicles in a similar use-case.

Page 1 of 15

About

CitusDB is a scalable analytics database that's built on top of PostgreSQL.

In this blog, we share our ideas and experiences on databases and distributed systems.

Connect

Download Software

View a Webinar