Analyzing GPS trajectories at scale with Postgres, MobilityDB, & Citus

Written by Mahmoud Sakr
November 9, 2020

This post by guest blogger Mahmoud Sakr about analyzing geospatial movement trajectories was originally published on the Azure Database for PostgreSQL Blog on Microsoft TechCommunity.

GPS has become part of our daily life. GPS is in cars for navigation, in smartphones helping us to find places, and more recently GPS has been helping us to avoid getting infected by COVID-19. Managing and analyzing mobility tracks is the core of my work. My group in Université libre de Bruxelles specializes in mobility data management. We build an open source database system for spatiotemporal trajectories, called MobilityDB. MobilityDB adds support for temporal and spatiotemporal objects to the Postgres database and its spatial extension, PostGIS. If you're not yet familiar with spatiotemporal trajectories, not to worry, we'll walk through some movement trajectories for a public transport bus in just a bit.

One of my team's projects is to develop a distributed version of MobilityDB. This is where we came in touch with the Citus extension to Postgres and the Citus engineering team. This post presents issues and solutions for distributed query processing of movement trajectory data. GPS is the most common source of trajectory data, but the ideas in this post also apply to movement trajectories collected by other location tracking sensors, such as radar systems for aircraft, and AIS systems for sea vessels.

As a start, let's explore the main concepts of trajectory data management, so you can see how to analyze geospatial movement trajectories.

The following animated gif shows a geospatial trajectory of a public transport bus1 that goes nearby an advertising billboard. What if you wanted to assess the visibility of the billboard to the passengers in the bus? If you can do this for all billboards and vehicles, then you would be able to extract interesting insights for advertising agencies to price the billboards, and for advertisers who are looking to optimize their campaigns.

public transport bus trajectory
Throughout this post, I'll use maps to visualize bus trajectories and advertising billboards in Brussels, so you can learn how to query where (and for how long) the advertising billboards are visible to the bus passengers. The background map is courtesy of OpenStreetMap.

In the animated gif above, we simply assume that if the bus is within 30 meters to the billboard, then it is visible to its passengers. This "visibility" is indicated in the animation by the yellow flash around the billboard when the bus is within 30 meters of the billboard.

How to measure the billboard's visibility to a moving bus using a database query?

Let's prepare a toy PostGIS database that minimally represents the example in the previous animated gif—and then gradually develop an SQL query to assess the billboard visibility to the passengers in a moving bus.

If you are not familiar with PostGIS, it is probably the most popular extension to Postgres and is used for storing and querying spatial data. For the sake of this post, all you need to know is that PostGIS extends Postgres with data types for geometry point, line, and polygon. PostGIS also defines functions to measure the distance between geographic features and to test topological relationships such as intersection.

In the SQL code block below, first you create the PostGIS extension. And then you will create two tables: gpsPoint and billboard.

CREATE TABLE gpsPoint (tripID int, pointID int, t timestamp, geom geometry(Point, 3812));
CREATE TABLE billboard(billboardID int, geom geometry(Point, 3812));

INSERT INTO gpsPoint Values
(1, 1, '2020-04-21 08:37:27', 'SRID=3812;POINT(651096.993815166 667028.114604598)'),
(1, 2, '2020-04-21 08:37:39', 'SRID=3812;POINT(651080.424535144 667123.352304597)'),
(1, 3, '2020-04-21 08:38:06', 'SRID=3812;POINT(651067.607438095 667173.570340437)'),
(1, 4, '2020-04-21 08:38:31', 'SRID=3812;POINT(651052.741845233 667213.026797244)'),
(1, 5, '2020-04-21 08:38:49', 'SRID=3812;POINT(651029.676773636 667255.556944161)'),
(1, 6, '2020-04-21 08:39:08', 'SRID=3812;POINT(651018.401101238 667271.441380755)'),
(2, 1, '2020-04-21 08:39:29', 'SRID=3812;POINT(651262.17004873  667119.331513367)'),
(2, 2, '2020-04-21 08:38:36', 'SRID=3812;POINT(651201.431447782 667089.682115196)'),
(2, 3, '2020-04-21 08:38:43', 'SRID=3812;POINT(651186.853162155 667091.138189286)'),
(2, 4, '2020-04-21 08:38:49', 'SRID=3812;POINT(651181.995412783 667077.531372716)'),
(2, 5, '2020-04-21 08:38:56', 'SRID=3812;POINT(651101.820139904 667041.076539663)');

INSERT INTO billboard Values
(1, 'SRID=3812;POINT(651066.289442793 667213.589577551)'),
(2, 'SRID=3812;POINT(651110.505092035 667166.698041233)');

The database is visualized in the map below. You can see that the gpsPoint table has points of two bus trips, trip 1 in blue and trip 2 in red. In the table, each point has a timestamp. The two billboards are the gray diamonds in the map.

bus trips gpsPoint map

Your next step is to find the locations where a bus is within 30 meters from a billboard—and also the durations, i.e. how long the moving bus is within 30 meters of the billboard.

SELECT tripID, pointID, billboardID
FROM gpsPoint a, billboard b
WHERE st_dwithin(a.geom, b.geom, 30);

--1    4    1

This PostGIS query above does not solve the problem. Yes, the condition in the WHERE clause finds the GPS points that are within 30 meters from a billboard. But the PostGIS query does not tell the duration of this event.

Furthermore, imagine that point 4 in trip 1 (the blue trip) was not given. This query would have then returned null. The problem with this query is that it does not deal with the continuity of the bus trip, i.e. the query does not deal with the movement trajectory of the bus.

We need to reconstruct a continuous movement trajectory out of the given GPS points. Below is another PostGIS query that would find both the locations of the billboard's visibility to the passengers in the bus, and also the duration of how long the billboard was visible to the bus passengers.

WITH pointPair AS(
  SELECT tripID, pointID AS p1, t AS t1, geom AS geom1,
    lead(pointID, 1) OVER (PARTITION BY tripID ORDER BY pointID) p2,
    lead(t, 1) OVER (PARTITION BY tripID ORDER BY pointID) t2,
    lead(geom, 1) OVER (PARTITION BY tripID ORDER BY pointID) geom2
  FROM gpsPoint
), segment AS(
  SELECT tripID, p1, p2, t1, t2,
    st_makeline(geom1, geom2) geom
  FROM pointPair
), approach AS(
  SELECT tripID, p1, p2, t1, t2, a.geom,
    st_intersection(a.geom, st_exteriorRing(st_buffer(b.geom, 30))) visibilityTogglePoint
  FROM segment a, billboard b
  WHERE st_dwithin(a.geom, b.geom, 30)
SELECT tripID, p1, p2, t1, t2, geom, visibilityTogglePoint,
  (st_lineLocatePoint(geom, visibilityTogglePoint) * (t2 - t1)) + t1 visibilityToggleTime
FROM approach;

Yes, the above PostGIS query is a rather complex one. We split the query into multiple common table expressions CTEs, to make it readable. In Postgres, CTEs give you the ability to "name" a subquery, to make it easier to write SQL queries consisting of multiple steps.

  • The first CTE, pointPair in Lines 1-7, uses the window function lead, in order to pack every pair of consecutive points, that belong to the same bus trip, into one tuple.

  • This is a preparation for the second CTE, segment in Lines 7-12, which then connects the two points with a line segment. This step can be seen as a linear interpolation of the path between every two GPS points.

The result of these two CTEs can be visualized in the map below:

CTEs results map

Then the third CTE, approach Lines 12-18, finds the locations where the bus starts/ends to be within 30 meters from the billboard. This is done by drawing a circular ring with 30 meters diameter around the billboard, and intersecting it with the segments of the bus trajectory. We thus get the two points in the map below, marked with the black cross.

bus start/end map

The last step in the earlier PostGIS query, Lines 19-22, computes the time at these two points using linear referencing, that is assuming a constant speed per segment2.

bus time reference map

Exercise: try to find a simpler way to express the PostGIS query displayed earlier. I couldn't. :-)

The PostGIS query had to be that complex, because it programs two non-trivial concepts:

  • Continuous movement trajectory: while the GPS data is discrete, we had to reconstruct the continuous movement trajectory.
  • Spatiotemporal proximity: the continuous movement trajectory was used to find the location and time (i.e., spatiotemporal) during which the bus was within 30 meters from the billboard.

The good news for you is that MobilityDB can help make it easier to analyze these types of movement trajectories. MobilityDB is an extension of PostgreSQL and PostGIS that has implemented these spatiotemporal concepts as custom types and functions in Postgres.

Let's have a look on how to express this PostGIS query more simply using MobilityDB.

MobilityDB: a moving object database system for Postgres & PostGIS

Here is how the previous PostGIS query would be expressed in MobilityDB.

SELECT astext(atperiodset(trip, getTime(atValue(tdwithin(a.trip, b.geom, 30), TRUE))))
FROM busTrip a, billboard b
WHERE dwithin(a.trip, b.geom, 30)

--{[POINT(651063.737915354 667183.840879818)@2020-04-21 08:38:12.507515+02,
    POINT(651052.741845233 667213.026797244)@2020-04-21 08:38:31+02,
    POINT(651042.581085347 667231.762425657)@2020-04-21 08:38:38.929465+02]}

What you need to know about the MobilityDB query above:

  • The table busTrip has the attribute trip of type tgeompoint. It is the MobilityDB type for storing a complete trajectory.
  • The nesting of tdwithin->atValue->getTime will return the time periods during which a bus trip has been within a distance of 30 meters to a billboard.
  • The function atperiodset will restrict the bus trip to only these time periods.
  • The function astext converts the coordinates in the output to textual format.
  • Accordingly, the result shows the part of the bus trip that starts at 2020-04-21 08:38:12.507515+02 and ends at 08:38:38.929465+02.

The MobilityDB documentation describes all of MobilityDB's operations.

Now we step back, and show the creation of the busTrip table.


CREATE TABLE busTrip(tripID, trip) AS
  SELECT tripID,tgeompointseq(array_agg(tgeompointinst(geom, t) ORDER BY t))
FROM gpsPoint

--Query returned successfully in 78 msec.

SELECT tripID, astext(trip) FROM busTrip;

1    "[POINT(651096.993815166 667028.114604598)@2020-04-21 08:37:27+02,
       POINT(651080.424535144 667123.352304597)@2020-04-21 08:37:39+02,
       POINT(651067.607438095 667173.570340437)@2020-04-21 08:38:06+02,
       POINT(651052.741845233 667213.026797244)@2020-04-21 08:38:31+02,
       POINT(651029.676773636 667255.556944161)@2020-04-21 08:38:49+02,
       POINT(651018.401101238 667271.441380755)@2020-04-21 08:39:08+02]"
2    "[POINT(651201.431447782 667089.682115196)@2020-04-21 08:38:36+02,
       POINT(651186.853162155 667091.138189286)@2020-04-21 08:38:43+02,
       POINT(651181.995412783 667077.531372716)@2020-04-21 08:38:49+02,
       POINT(651101.820139904 667041.076539663)@2020-04-21 08:38:56+02,
       POINT(651262.17004873  667119.331513367)@2020-04-21 08:39:29+02]"
  • The first step above is to create the MobilityDB extension in the database. In Postgres, the CASCADE option results in executing the same statement on all the dependencies. In the query above—because PostGIS is a dependency of MobilityDB—CASCADE will also create the PostGIS extension, if it has not yet been created.

  • The second query above creates the busTrip table with two attributes (tripID int, trip tgeompoint). tgeompoint is the MobilityDB type to represent a movement trajectory. The tgeompoint attribute is constructed from a temporally sorted array of instants, each of which is a pair of a spatial point and a timestamp. This construction is expressed in the query above by the nesting of tgeompointinst -> array_agg -> tgeompointseq.

  • The last SELECT query above shows that the busTrip table contains two tuples, corresponding to the two trips. Every trip has the format [point1@time1, point2@time2, ...].

Bigger than an elephant: how to query movement trajectories at scale, when a single Postgres node won't do

As we now have two working solutions for measuring the billboard visibility: one in PostGIS and another one in MobilityDB, the next natural move is to apply these solutions to a big database of all bus trips in Brussels in the last year, and all billboards in Brussels. This amounts to roughly 5 million bus trips (roughly 5 billion GPS points) and a few thousand billboards. This size goes beyond what a single Postgres node can handle. Hence, we need to distribute the Postgres database.

This is a job for Citus, the extension to Postgres that transforms Postgres into a distributed database. Efficiently distributing the complex PostGIS query with many CTEs is a challenge we'll leave to the Citus engineering team.

What I want to discuss here is the distribution of the MobilityDB query. Citus does not know the types and operations of MobilityDB. So the distribution is limited by what Citus can do in general for custom types and functions. My colleague, Mohamed Bakli, has done this assessment and published it in a paper titled "Distributed moving object data management in MobilityDB" in the ACM BigSpatial workshop (preprint) and in a demo paper titled "Distributed Mobility Data Management in MobilityDB" in the IEEE MDM conference (preprint).

The papers presented a solution to distribute MobilityDB using Citus. All the nodes in the Citus database cluster had PostgreSQL, PostGIS, MobilityDB, and Citus installed. The goal was to assess to what extent the spatiotemporal functions in MobilityDB can be distributed.

To do this assessment, the BerlinMOD benchmark (a tool for comparing moving object databases) was used. BerlinMOD consists of a trajectory data generator, and 17 benchmark queries that assess the features of a moving object database system. It was possible to execute 13 of the 17 BerlinMOD benchmark queries on a MobilityDB database cluster that is managed by Citus, without special customization.

See also the illuminating blog post about using custom types with Citus & Postgres, by Nils Dijk.

Back to our MobilityDB billboard visibility query, our mission is to calculate the billboard visibility for all billboards and all common transport vehicles in Brussels for an entire year.

We had set up a Citus database cluster, and created the MobilityDB extension in all its nodes. Then we used the Citus create_distributed_table function to distribute the busTrip table across all the worker nodes in the Citus database cluster. Next we made the billboard table a Citus reference table, and copied the reference table to all the worker nodes.

Here is the resulting distributed query plan:

SELECT atperiodset(trip, getTime(atValue(tdwithin(a.trip, b.geom, 30), TRUE)))
FROM busTrip a, billboard b
WHERE dwithin(a.trip, b.geom, 30);

Query plan
Custom Scan (Citus Adaptive)  (cost=0.00..0.00 rows=100000 width=32)
  Task Count: 32
  Tasks Shown: One of 32
  ->  Task
      Node: host= port=5432 dbname=roma
      ->  Nested Loop  (cost=0.14..41.75 rows=1 width=32)
          ->  Seq Scan on provinces_dist_102840 b (cost=0.00..7.15 rows=15 width=32)
          ->  Index Scan using spgist_bustrip_idx_102808 on bustrip_hash_tripid_102808 a
              (cost=0.14..2.30 rows=1 width=32)
              Index Cond: (trip && st_expand(b.geom, '30'::double precision))
              Filter: _dwithin(trip, b.geom, '30'::double precision)

The Citus distributed query executor parallelizes the query over all workers in the Citus cluster. Every node also has the MobilityDB extension, which means we can use MobilityDB functions such as dwithin in the query and in the indexes. Here for example, we see that the SP-GiST index on the Citus worker is used to efficiently evaluate the WHERE dwithin(...) clause.

With this, we come to the end of this post. To sum up, this post has two main takeaways:

If you're ever looking to analyze movement trajectories to understand the spatiotemporal interaction of things across space and time, you now have a few new (open source!) options in your Postgres and PostGIS toolbox:

  • MobilityDB can help you to manage and analyze geospatial (e.g. GPS, radar) movement trajectories in PostgreSQL.

  • MobilityDB + Citus open source work together out of the box, so you can analyze geospatial movement trajectories at scale, too. Just add the two Postgres extensions (along with PostGIS) into your Postgres database, and you are ready to manage big geospatial trajectory datasets.


  1. Curious about the source of this data? The trajectory is for line 71 in Brussels, when it goes in front of my university campus ULB Solbosch. The public transport company in Brussels publishes an open API, where all the trajectories of their vehicles can be probed The billboard location was invented by me, and the background map comes from Openstreetmap.
  2. It remains to compute the visibility duration, i.e., the difference in seconds between the two timestamps, which can be done by another CTE and window functions. Not to further complicate the query, we skip this detail here.
Mahmoud Sakr

Written by Mahmoud Sakr

Computer science professor at Université libre de Bruxelles. Co-founder of the MobilityDB extension of Postgres.