Cloudflare makes sites lightning fast, protects them from attacks, ensures they are always online, and makes it simple to add web apps with a single click. More than 5 percent of global web requests flow through Cloudflare’s network, and the company handles more traffic than Amazon, Wikipedia, Twitter, Apple and Bing, combined. More than 4 million websites enjoy significant performance gains and protection from cyberattacks.
In 2013, with data traffic growing in excess of 400 percent annually, the company’s original log processing and analytics pipeline, built using Perl scripts, C++, and a single PostgreSQL instance, had reached its limits. Cloudflare’s small analytics team needed an easy-to-operate solution that would not require significant development effort. It also needed to be able to return typical customer queries in less than a second, even when simultaneously querying data from many customer sites over long time windows. The company trusted and wanted to keep using PostgreSQL, but sharding it would have been difficult and taken significant development time.
"Fortunately, Citus maintains compatibility with PostgreSQL instead of forking the code base like other vendors,"" said Albert Strasheim, systems engineer at Cloudflare. "This allowed us to leverage our existing work and expertise, including several Postgres extensions. Because Citus enables both real-time data ingest and sub-second queries across billions of rows, it quickly became the backbone of our customer analytics solution.""
Thanks to Citus, Cloudflare analytics now provides website owners with detailed information on hundreds of millions of site visitors and potential threats, including requests, country of origin, IP address, type of traffic, and much more. Cloudflare also gained cross-customer visibility, aggregating non-user identifying data from millions of customers and performing complex queries with response times measured in seconds to a few minutes.
"Because Citus enables both real-time data ingest and sub-second queries across billions of rows, it quickly became the backbone of our customer analytics solution"Albert Strasheim, Systems Engineer, Cloudflare.
Cloudflare and Citus
Faster Time to Value
Because Citus is code compatible with PostgreSQL and worked with Cloudflare’s existing Postgres extensions, implementing Citus was simple. "If you can run a single Postgres database, you can run a Citus cluster," said Strasheim. "Other options would probably have required five people and a year of development time. We were up and running in less than three months with the equivalent of two and a half people."
"We love the Citus architecture because it enabled us to continue using Postgres," added Strasheim. "Our analytics dashboard and business intelligence tools connect to Citus using standard Postgres connectors, and tools like pg_dump and pg_upgrade work great. Two features that stand out for us are Citus’s Postgres extensions that power our analytics dashboards, and Citus’s ability to parallelize the logic in those extensions out of the box."
The Cloudflare cluster runs on commodity servers. Increasing the capacity of the cluster is as simple as adding additional commodity servers and rebalancing the shards across the cluster. The cluster houses 100 TB of data using approximately 1 million shards, with each shard replicated to multiple machines. Queries against the cluster take between 25 milliseconds and 2 seconds, depending on whether some or all of the data is available in page cache.
Citus utilizes one master node to hold authoritative metadata about the shards in the cluster and parallelize incoming queries. Worker nodes then run the queries. When the application sends a query to the cluster, the master node finds the relevant shards, transforms the query into many smaller queries for parallel execution, and ships those smaller queries to the worker nodes. The master node receives intermediate results from the workers, merges them, and returns the final results. This ability to distribute queries across commodity servers without sacrificing performance provides Cloudflare with very cost-effective long-term scalability.
Initially, Cloudflare evaluated what it would take to build an application that sharded data on top of PostgreSQL using the postgres_fdw extension to provide a unified view of multiple independent PostgreSQL servers. This solution, however, would not tolerate down servers. By contrast, Citus enables Cloudflare to replicate multiple logical shards to independent machines in the cluster and automatically fail over between replicas even during queries. In the event of a hardware failure, Cloudflare can use the Citus rebalance function to re-replicate shards in the cluster.
Real-Time, Granular, Cross-Customer Visibility
The ability to use Citus for business intelligence is helping Cloudflare grow its business more cost effectively. "The ease with which we can run distributed queries on the data allows our product, operations and sales teams to quickly get the answers they need to improve performance, enhance the customer experience and increase sales," said Strasheim. "This instant cross-customer visibility using a single SQL query has eliminated the need to spend hours digging through granular data, enabling our teams to spend significantly more time creating value for our business."