Distributing Postgres with a Distributed Team at Citus Data

Over the last two years, our engineering team at Citus Data has shortened release cycles from 12 months all the way down to 8 weeks. The most recent 7.2 release of the Citus database took 8 weeks exactly, start to finish.

These shortened release cycles have been chock full of new capabilities for our users, including distributed deadlock detection in Citus 7.0, multi-shard updates and deletes in Citus 7.1, and support for CTE’s (common table expressions) and complex Postgres subqueries in Citus 7.2.

On the Citus Cloud side (that’s our fully-managed database as a service that runs on AWS), we’ve recently added fork, followers, fully-online "warp" migration from existing PostgreSQL installations, and point-in-time-recovery (PITR), just to name a few.

When I step back to think about how we got here (as a co-founder of Citus Data, I’ve been here since the beginning), it’s no surprise that I attribute much of what we’ve accomplished to our team. But here’s the point about all these accomplishments that I think is so interesting: our engineering team is distributed across 5 countries and 6 different cities.

The distributed nature of the Citus engineering team is a key part of our success. While I don’t want to suggest that our engineering learnings at Citus Data are cardinal rules for distributed development for everyone else, I think it’s worth sharing our learnings here.

Why Distributed Development? Because good developers are everywhere

Building a distributed database is hard, and requires skills in both breadth and depth. Developers need to be able to reason about distributed systems, consistency, and failure scenarios—all the time understanding PostgreSQL internals like the storage subsystem, the planner, and the executor. While the San Francisco Bay Area has a rich talent pool available, clearly competition is stiff, and we wanted to opportunistically hire strong referenced software engineers anywhere we could find them.

The PostgreSQL development community is also distributed. Thus recruiting Postgres experts demands being open to hiring across different geographies. The founding team at Citus Data also spent early formative days in Turkey, so it just made sense to advantage of the strong local engineering talent in Turkey.

Growing Pains

When we were a small early-stage Y Combinator startup, we as founders continued to be part of all feature development, including design discussions and sign-offs to code reviews.

As we grew the team across geographies, it became untenable to gate all our development efforts through the founders. Responses took a minimum of 12 hours, and more as the team grew—which meant that the turnaround time on decisions was a minimum of 24 hours and often longer if a back & forth discussion was needed. Engineers quickly started losing productivity, and our development started incurring delays.

We had to decentralize the way we performed code and design reviews, and provide more autonomy for the engineering teams in both San Francisco and Istanbul (our first two primary engineering hubs.)

Apart from code reviews, we also faced challenges in project management and task assignment. Distributed engineers were gated on us (the founders) for priority and next tasks. If engineers were blocked on a task, they had to again wait for the time zone difference to wait to hear from us. Our releases started to slip, and release dates were way too unpredictable.

We realized we had to put in place the right talent, processes, and tooling in place if we were to make our distributed development team a success.

Recruit a Senior Engineer in Each Timezone

Recruit at least one Principal/Senior Engineer in each timezone. Make sure that person is has an interest in important leadership skills like mentorship and communication. The technical & architectural skills of our senior hires were key to growing the technical capabilities of the engineering team. And the communication skills were key to be able to operate effectively as a distributed team.

Of course, hiring in different geographies can be challenging. Serendipity helped. One of us had a former colleague who had just moved to Europe and the timing was perfect.

The result of hiring a senior developer in each of our core geographies:

Engineers had someone they could get help from in the same time zone.
Turnaround times for decisions decreased dramatically.
We were able to bring junior engineers up to speed more quickly.
We were able to grow our junior engineers into more seniors roles more quickly.
By avoiding delays & inefficiencies, we were able to grow the team more quickly.
Overall, the engineering team became much more productive, by orders of magnitude.

Find Someone Skilled at Project Management, Locally

Recruit or designate a person for local project management. Having someone in the local timezone who is fully aware of all the priorities and dependencies helped to unblock engineers on task assignments, code reviews, and other technical resourcing questions. This helped efficiency within the team a lot, and also made sure we were always working on the highest priority items. The autonomy also helps when there are critical bugs, where the team can make the call to shuffle tasks around and prioritize and fix bugs, without waiting for explicit approval.

Create a Travel Calendar

There is no substitute for face to face interaction, whether to build friendships or to collaborate on a complex project. Projects which had some in-person discussion or collaboration were much more successful. As our team grew, we also realized there were people who had never met each other. We started to consider a travel calendar, so we could formalize travel and put together a plan for engineers to travel and meet.

Annual Company Meeting—Critical for Team Building

Once a year the entire team gets together in one location for a full-team offsite. We are usually all together for at least a week. We use this all-company time together to coordinate on goals, share information, and just hang out together, forging social bonds that inevitably help us when geography separates us again. It’s also a chance for new people to meet everyone else on the team and vice versa.

During the all-company meeting, we also hold a developer specific session, where we discuss issues (if any, but there are always some to discuss!) that have been impacting our developers. We make sure that we always come out of the developer session with specific written action items that can be tracked and followed up on.

The Citus Data Hack-Day—Fun, Creativity, and a Good Way to Bond

We also hold a hackathon which helps pair up engineers from different geographies together. Building these social bonds is important, as it helps transcend the otherwise cold and formal communication of a code or design review.

Regular Face-to-Face Visits to Build Understanding

In addition to the annual all-company meeting, we try to get every engineer who is not based in our headquarters to visit our San Francisco HQ in SOMA at least once more during the year. We try to coordinate this visit with a project where they might need to collaborate with someone here in SF, where the high bandwidth communication is critical for the right design or approach.

In addition, we also typically have our distributed engineers shadow our sales/sales-engineering teams while they are in SF, to give them more context into our customers and open source users, and some of the problems they face.

Weekly Team Call, at a Time-of-Day That Works For All

We conduct a once-a-week team-wide meeting, which has a regular section of rotating functional updates. This team call includes everyone from all 7 Citus cities. Our weekly call is also cross-functional, with team members from engineering, documentation, marketing, & sales.

Meetings can get a bad rap. But when done right, a meeting helps to align everyone in the team, and also gives them visibility into different areas in the company. Our weekly sync is sometimes followed by a deep-dive into a specific project or area, where engineers can demo and describe new functionality. Smaller groups of people have also found it advantageous to hold a ‘virtual meeting’, where people all get together either on Slack or on Hangouts and use that as an open forum to chat about projects or other blocking issues. We use all the various forms of technology to enable asynchronous & electronic communication, but sometimes nothing beats talking to each other.

Tools

Slack

It’s no surprise that we end up using Slack pretty heavily. We have various channels, where developers can chat about code/projects etc. We also have a channel setup for communication between developers and sales-engineering, so feature specific clarification questions have a forum to be asked.

Slack enables both synchronous as well as asynchronous communication, allowing real-time conversations in a geography, and people from another timezone to catch up on the thread the next day. It also allows for distributed social conversations, and we have channels like #pets, where everyone can share pictures and stories of their favorite furball!

My dog from the Citus pets channel

We recently also created a #gong channel, where we celebrate successes, big or small. Examples include winning a customer, a new hire joining, launching new web pages, as well as deploying new features. The #gong channel is a great way to virtualize the gongs that teams historically used to celebrate successes when they were co-located, and gives our distributed team a way to highlight all the positive news within the company.

Google Drive

We use Google Drive for most of our internal documentation and artefacts. We have folders specific to functions (like sales/marketing/dev etc), and it acts as a good searchable store of our company information. In the dev-specific folder we store design documents, benchmarking results, testing results etc. The integration with Google Docs and its security features provides us with enough functionality to communicate and coordinate.

Video Calling

We have set up dedicated video-calling machines in both Istanbul and San Francisco. In SF we use a chrome-box, hooked up to a large TV screen and with a dedicate mic and camera. This works much better than a laptop microphone/camera combination for meetings involving multiple people.

For video conferencing, we currently use a mix of Google Hangouts, Zoom, and GoToMeeting. Unfortunately we haven’t yet found a more reliable tool for video calling, and tend to have semi-recurring issues with each of them. Apart from just call quality, we’ve had issues with the software recognizing mics, volume being too low or high, or echo and feedback which makes the software unusable. This gets especially problematic in larger meetings where people on the call find it hard to follow the meeting and keep track of who’s speaking.

Distributed development has made our Citus database engineering team stronger

There is no silver bullet for distributed development.

Recommendations based on what has worked for our team may not work for you. To claim that we’ve solved the barriers posed by distributed development would be arrogant at best. However, our experience is that we have been able to grow a distributed team faster—and along with some of our best practices, we’ve found a way to be efficient. And our distributed engineering team iterates fast.

As we grow our company and our distributed team, I’m sure we will continue to face new challenges. And we will rise to the occasion. Who knows, maybe there is another blog post (a part II) on managing distributed database development in my future.

Written by Sumedh Pathak

Former principal engineer on the Postgres team at Microsoft. Co-founder & VP of Engineering at Citus Data. Speaker at QCon London & DataEngConf SF. M.S. Computer Science Stanford. Family. Tennis ball. Dog.