POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
POSETTE 2024 is a wrap! 💯 Thanks for joining the fun! Missed it? Watch all 42 talks online 🍿
Written by Joe Nelson
March 30, 2016
It may surprise you that pagination, pervasive as it is in web applications, is easy to implement inefficiently. In this article we'll examine several methods of server-side pagination and discuss their tradeoffs when implemented in PostgreSQL. This article will help you identify which technique is appropriate for your situation, including some you may not have seen before which rely on physical clustering and the database stats collector.
Before continuing it makes sense to mention client-side pagination. Some applications transfer all (or a large part) of the server information to the client and paginate there. For small amounts of data client-side pagination can be a better choice, reducing HTTP calls. It gets impractical when records begin numbering in the thousands. Server-side has additional benefits such as
PostgreSQL gives us a number of server-side pagination techniques that differ in speed, integrity (not missing records), and support for certain page access patterns. Not all methods work in all situations, some require special data or queries. Let’s consider the methods in order of generality, starting with those that work for any query, then those which require ordered data. We’ll conclude with some exotic methods which rely on PostgreSQL internals.
The easiest method of pagination, limit-offset, is also most perilous. Sadly it's a staple of web application development tutorials. Object relational mapping (ORM) libraries make it easy and tempting, from SQLAlchemy's .slice(1, 3) to ActiveRecord's .limit(1).offset(3) to Sequelize's .findAll({ offset: 3, limit: 1 }). They all generate SQL ending in LIMIT 1 OFFSET 3. It's no coincidence that limit-offset use is widespread, you can tack it onto any query without further modification.
ORM methods to limit and offset the data are one thing, but pagination helper libraries can be even more deceptive. For instance the popular Ruby library Kaminari uses limit-offset by default, while hiding it behind a high-level interface.
The technique has two big problems, result inconsistency and offset inefficiency. Consistency refers to the intention that traversing a resultset should retrieve every item exactly once, without omissions or duplication. Offset inefficiency refers to the delay incurred by shifting the results by a large offset.
Here's how limit-offset pagination can be inconsistent. Suppose a user moves from page n to n+1 while simultaneously a new element is inserted into page n. This will cause both a duplication (the previously-final element of page n is pushed into page n+1) and an omission (the new element). Alternatively consider an element removed from page n just as the user moves to page n+1. The previously initial element of page n+1 will be shifted to page n and be omitted.
Now for the inefficiency. Large offsets are intrinsically expensive. Even in the presence of an index the database must scan through storage, counting rows. To utilize an index we would have to filter a column by a value, but in this case we require a certain number of rows irrespective of their column values. Furthermore the rows needn't have the same size in storage, and some may be present on disk but marked as deleted so the database cannot use simple arithmetic to find a location on disk to begin reading results. Let's measure the slowdown.
-- Create table with random strings of various lengths
CREATE TABLE medley AS
SELECT
generate_series(1,10000000) AS n,
substr(concat(md5(random()::text), md5(random()::text)), 1, (random() * 64)::integer + 1) AS description;
-- Notify query planner of drastically changed table size
VACUUM ANALYZE;
-- Low offsets are refreshingly fast
EXPLAIN ANALYZE SELECT * FROM medley LIMIT 100;
The estimated cost is quite low:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.85 rows=100 width=38) (actual time=0.008..0.036 rows=100 loops=1)
-> Seq Scan on medley (cost=0.00..185460.60 rows=9999660 width=38) (actual time=0.007..0.017 rows=100 loops=1)
Planning time: 0.040 ms
Execution time: 0.059 ms
(4 rows)
Choosing offset=1000 makes cost about 19 and has a 0.609 ms execution time. Once offset=5,000,000 the cost goes up to 92734 and execution time is 758.484 ms.
These problems don't necessarily mean that limit-offset is inapplicable for your situation. In some applications users don't typically advance many pages into a resultset, and you might even choose to enforce a server page limit. If result inconsistency and restricted page numbers aren't a problem in your application then limit-offset may be convenient for your needs.
When to Use: Limit-offset
Applications with restricted pagination depth and tolerant of result inconsistencies.
Despite its disadvantages limit-offset does have the advantage of being stateless on the server. Contrast it with another pagination approach, query cursors. Like offsets, cursors can be used in any query, but they differ by requiring the server to hold a dedicated database connection and transaction per HTTP client.
Here is how cursors can be used:
-- We must be in a transaction
BEGIN;
-- Open a cursor for a query
DECLARE medley_cur CURSOR FOR SELECT * FROM medley;
-- Retrieve ten rows
FETCH 10 FROM medley_cur;
-- ...
-- Retrieve ten more from where we left off
FETCH 10 FROM medley_cur;
-- All done
COMMIT;
Cursors have the desirable property of pagination consistency on arbitrary queries, showing results as they exist at the time the transaction was started. The isolation level (link is external) of the transaction guarantees that our paginated view of the results won’t change.
Every pagination approach has a downside, and the problems with cursors are resource usage and client-server coupling. Each open transaction consumes dedicated database resources, and is not scalable for too many clients. There are also "WITH HOLD" cursors which can exist outside of a transaction, but they must materialize data. Either way this makes cursor pagination appropriate only for small scale situations like intranet use.
Bridging HTTP to cursors introduces complications. Servers must identify clients across requests, either through a token or by keeping an identifier such as the client IP address in a session. Servers must also judge when to release transactions due to inactivity. Finally server load balancing becomes complicated, since each client must connect to a dedicated server each time.
When to Use: Cursors A single-server intranet application which must paginate queries with varied and changeable ordering, especially where result consistency matters.
The techniques above can paginate any kind of query, including queries without order clauses. If we are willing to forgo this generality we reap optimizations. In particular when ordering by indexed column(s) the client can use values in the current page to choose which items to show in the next page. This is called keyset pagination.
For example let's return to the medley example:
-- Add an index for keyset pagination (btrees support inequality)
CREATE INDEX n_idx ON medley USING btree (n);
SELECT * FROM medley ORDER BY n ASC LIMIT 5;
With my randomized data it returns
n | description
---+-------------------------------------------------------------
1 | 74f70e009396
2 | 8dac5a085eb670a29058d
3 | fce303a32e89181bf5df1601487
4 | fddcced2c12e83516b3bd6cc94f23a012dfd
5 | f51ae548dd27f51147e53e839eeceb6b0c92922145276d668e73d4a6621
(5 rows)
Now the client can look at the maximal n in this result and use it for requesting the next page:
SELECT *
FROM medley
WHERE n > 5
ORDER BY n ASC
LIMIT 5;
Even filtering by n > 5000000 remains fast, unlike the limit-offset example.
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..0.62 rows=5 width=38) (actual time=0.101..0.103 rows=5 loops=1)
-> Index Scan using n_idx on medley (cost=0.43..185579.42 rows=5013485 width=38) (actual time=0.100..0.102 rows=5 loops=1)
Index Cond: (n > 5000000)
Planning time: 0.071 ms
Execution time: 0.119 ms
(5 rows)
Keyset pagination is fast, and it is consistent too. Any insertions/deletions before the current page will leave the results unaffected. The two downsides to this method are lack of random access and possible coupling between client and server.
In general there is no way to jump directly to a given page without visiting prior pages to observe their maximal elements. Under certain conditions we can do better though. If values in the indexed column are uniformly distributed (or even better, contiguous numbers without gaps) the client can do some math to find the desired page because the index makes it cheap to find the largest value:
EXPLAIN ANALYZE SELECT max(n) FROM medley;
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Result (cost=0.46..0.47 rows=1 width=0) (actual time=0.021..0.021 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.46 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=1)
-> Index Only Scan Backward using n_idx on medley (cost=0.43..284688.43 rows=10000000 width=4) (actual time=0.017..0.017 rows=1 loops=1)
Index Cond: (n IS NOT NULL)
Heap Fetches: 0
Planning time: 0.087 ms
Execution time: 0.042 ms
(8 rows)
The other issue of keyset pagination, client/server coupling, requires care. First the client doesn't know which columns are indexed. The server will likely need to provide an endpoint with fixed order rather than allowing the client to customize the ordering. Given the client code may not know which column is being ordered, the server must provide a hint for how to request the next page. RFC5988 defines HTTP link relations previous and next to encode links for the client to follow.
Since users typically access pages of information in a linear fashion, keyset pagination is usually considered the best choice for paginating ordered records in high-traffic web servers.
When to Use: Keyset Scalable applications serving data sequentially from column(s) indexed for comparisons. Supports filtering.
We can devise nonstandard pagination techniques for special situations using low level PostgreSQL features. For instance we can implement truly random-access access on data if we
The trick is to choose returned pages which correspond directly with database pages on disk or to sections of those disk pages. Every table in a PostgreSQL database contains a secret column called ctid which identifies its row:
SELECT ctid, * FROM medley WHERE n <= 10;
ctid | n | description
--------+----+-------------------------------------------------------------
(0,1) | 1 | 74f70e009396
(0,2) | 2 | 8dac5a085eb670a29058d
(0,3) | 3 | fce303a32e89181bf5df1601487
(0,4) | 4 | fddcced2c12e83516b3bd6cc94f23a012dfd
(0,5) | 5 | f51ae548dd27f51147e53e839eeceb6b0c92922145276d668e73d4a6621
(0,6) | 6 | eb9fe1dfe1e421903f96b3b5c5dfe1ee1253582d728c35b4ee7330b
(0,7) | 7 | e95202d7f5c612f8523ae705d
(0,8) | 8 | 6573b64aff262a2b940326
(0,9) | 9 | a0a43
(0,10) | 10 | 82cdc134bd249a612cfddd3088dd09e32de5f4fa33
(10 rows)
Each ctid is of the form (page,row). PostgreSQL can retrieve rows very quickly by ctid, in fact this is how indices work internally -- they map column values to ctids.
Note that although PostgreSQL defines an order relation on the tid type, it cannot efficiently retrieve ctids by inequality
EXPLAIN ANALYZE SELECT count(1) FROM medley WHERE ctid >= '(0,1)'::tid AND ctid < '(1,0)'::tid;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=235589.00..235589.01 rows=1 width=0) (actual time=1241.851..1241.852 rows=1 loops=1)
-> Seq Scan on medley (cost=0.00..235464.00 rows=50000 width=0) (actual time=477.933..1241.802 rows=116 loops=1)
Filter: ((ctid >= '(0,1)'::tid) AND (ctid < '(1,0)'::tid))
Rows Removed by Filter: 9999884
Planning time: 0.047 ms
Execution time: 1241.889 ms
(6 rows)
Requesting ranges doesn’t work but there is still a way to efficiently request all rows in a disk page. Every page contains currentsetting('blocksize') bytes of data (often 8k). Rows are referenced by a 32-bit pointer so there are at most block_size/4 rows per page. (In fact rows are typically wider than the minimum size and a quarter of the block size provides an upper bound of rows per page.) The following sequence will generate all possible ctids in the jth page
SELECT ('(' || j || ',' || s.i || ')')::tid
FROM generate_series(0,current_setting('block_size')::int/4) AS s(i);
Let's use it to get all rows in medley on page zero.
SELECT * FROM medley WHERE ctid = ANY (ARRAY
(SELECT ('(0,' || s.i || ')')::tid
FROM generate_series(0,current_setting('block_size')::int/4) AS s(i)
)
);
The planner identified this query as having cost=25.03..65.12 and it runs in 2.765ms. Requesting page 10,000 has similar cost. So we're getting true random access, what's not to love?
There are three downsides
In certain situations this is not a problem. One case is data whose natural order corresponds to insertion order such as append-only time-series data. Another is data that doesn't change often. This is because we have control over the placement of rows within pages through the CLUSTER command.
Let's go back to our medley example. Its rows on disk are ordered by the n column ascending because that is the order in which we inserted them. What if we want to sort by the description column? The answer is to physically reorder the table by index the description column and clustering.
CREATE INDEX description_idx ON medley USING btree (description);
CLUSTER medley USING description_idx;
Now selecting all rows in the first page comes back alphabetized by description. If the table changes then new rows will be appended out of alphabetical order, but as long as the table doesn't change the returned items will fine. It can also be periodically re-clustered after changes although this operation locks the table and cannot be done when people need to access it.
Finally it's possible to determine the total number of pages for the table using its total byte size.
SELECT pg_relation_size('medley') / current_setting('block_size')::int;
When to Use: TID Scan
When fast deep random page access is required and filtering is not needed. Works especially well with append-only time-series data having low-variance row width.
As we saw, plain keyset pagination offers no facility to jump a certain percentage into the results except through client guesswork. However the PostgreSQL statistics collector maintains per-column histograms of value distribution. We can use these estimates in conjunction with limits and small offsets to get fast random-access pagination through a hybrid approach.
First let’s look at the statistics of our medley:
SELECT array_length(histogram_bounds, 1) - 1
FROM pg_stats
WHERE tablename = 'medley'
AND attname = 'n';
In my database the column n has 101 bound-markers, i.e. 100 ranges between bound-markers. The particular values aren’t too surprising because my data is uniformly distributed
{719,103188,193973,288794, … ,9690475,9791775,9905770,9999847}
Notice that the values are approximate. The first number is not exactly zero, and the last is not exactly ten million. The ranges divide our information into a block size B = 10,000,000 / 100 = 100,000 rows.
We can use the histogram ranges from the PostgreSQL stats collector to obtain probabilistically correct pages. If we choose a client-side page width of W how do we request the ith page? It will reside in block iW / B, at offset iW % B.
Choosing W=20 let’s request page 270,000 from the medley table. Note that PostgreSQL arrays are one-based so we have to adjust the values in the array lookups:
WITH bookmark AS (
SELECT (histogram_bounds::text::int[])[((270000 * 20) / 100000)+1] AS start,
(histogram_bounds::text::int[])[((270000 * 20) / 100000)+2] AS stop
FROM pg_stats
WHERE tablename = 'medley'
AND attname = 'n'
LIMIT 1
)
SELECT *
FROM medley
WHERE n >= (select start from bookmark)
AND n < (select stop from bookmark)
ORDER BY n ASC
LIMIT 20
OFFSET ((270000 * 20) % 100000);
This performs blazingly fast (notice the offset happens to be zero here). It gives back rows with n = 5407259 through 5407278. The true values on page 270000 are n = 5400001 through 5400020. The values is off by 7239, or about 0.1%.
We were lucky in our page choice there. For contrast, page 74999 requires an offset of 99980. We do know that our offset will be at most 100,000. The upper bound is within our control if we care to make a tradeoff. By adjusting the PostgreSQL stats collector we can get a more precise column histogram
ALTER TABLE medley ALTER COLUMN n SET statistics 1000;
VACUUM ANALYZE;
Now there are 1000 rather than 100 histogram buckets. On my database they have values
{10,10230,20863, …, 9980444,9989948,9999995}
With this bucket size our offset will be at most 10,000. The tradeoff is that the query planner now has to look through more values, slowing it down. So it’s a tradeoff of potential offset inefficiency vs query planner overhead.
This hybrid keyset/offset method probably doesn’t correspond to many real pagination use cases. It will not work with where clauses. It’s inaccurate and gets more so when the table changes and the stats collector hasn’t recently run.
When to Use: Keyset with Bookmarks When the client would like deep but approximate random access with no extra filtering allowed.
Like many engineering decisions, choosing pagination techniques involves tradeoffs. It’s safe to say that keyset pagination is most applicable for the average site with ordered linear access. However even limit-offset has its strengths, and more exotic techniques provide special performance characteristics for certain kinds of data. You can see there quite a few possibilities. Pick the right tool for the job and don’t let pagination be a closed book.