Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a cursor for pagination in Graphql from a database?

I am having terrible problems getting a real cursor for resolving a database pagination result in GraphQL. No matter what kind of database (SQL e.g. mysql or NoSQL document e.g. mongodb) I am using, there is no way, I seem to be able to get a cursor or cursorlike object.

Propably I am missing out on some fundamental concepts but after searching my b... off I am beginning to seriously doubt whether the official GraphQL pagination documentation

https://graphql.org/learn/pagination/

is based on any real live experience at all.

Here's my question: How can I get anything even remotely resembling a cursor from a SQL query like this?

SELECT authors.id, authors.last_name, authors.created_at FROM authors
ORDER BY authors.last_name, author.created_at
LIMIT 10
OFFSET 20

I know, offset based pagination should not be used and instead cursor based navigation is considered a remedy. And I'd definitely like to cure my application from the offset disease. But in order to do that I need to be able to retrieve a cursor from somewhere.

I also understand (forgot where I read that) that primary keys should not be used for pagination either.

So, I am stuck here.

like image 684
LongHike Avatar asked Jul 11 '19 12:07

LongHike


People also ask

What is a cursor GraphQL?

This specification aims to provide an option for GraphQL clients to consistently handle pagination best practices with support for related metadata via a GraphQL server. This spec proposes calling this pattern “Connections” and exposing them in a standardized way.

What is cursor-based pagination?

Cursor-based pagination works by returning a pointer to a specific item in the dataset. On subsequent requests, the server returns results after the given pointer.

What is cursor offset?

An offset is simply the number of records the database should skip before selecting records. That means it's not just returning the next set of data you're requesting, it's scanning through all the previous data that comes before it.

What is the best approach to paginate Records in GraphQL?

This approach of paginating records using first and offset is ideal for static data. However, it is not recommended for the data that changes over time. The GraphQL specification recommends Cursor-based pagination. Before getting into the details of Cursor-based pagination, let's first try to understand the problems with the offset approach:

What is a pagination model in GraphQL?

Different pagination models enable different client capabilities. A common use case in GraphQL is traversing the relationship between sets of objects. There are a number of different ways that these relationships can be exposed in GraphQL, giving a varying set of capabilities to the client developer.

Why do we use cursors for pagination?

Especially if the cursors are opaque, either offset or ID-based pagination can be implemented using cursor-based pagination (by making the cursor the offset or the ID), and using cursors gives additional flexibility if the pagination model changes in the future.

Why do we use cursors in SQL?

This ensures that duplicate records are not fetched again, always reading data after a specific row rather than relying on the position of the record. Common examples of cursors include id (auto-incrementing integer/big integer), created_at timestamp which should be unique and sequential. Consider the following query with a where clause.


Video Answer


1 Answers

I think you're being down-voted for asking a good question. The first/last/before/after concept is difficult to implement in SQL.

I've been breaking my head over the same problem. The pagination documentation does not address how to define cursors when you are applying custom ORDER statements.

And I haven't really found a comprehensive solution online either. I found some posts where people are addressing the issue, but the answers are only partially correct or partially complete (just base64 encode the ID field to make a cursor seems to be the go-to answer, but that says little on what the query actually has to do to compute the cursor). Also any solutions involving row_number are quite ugly and not applicable across different SQL dialects. So let's try it differently.

Quick disclaimer, this is going to be a fairly comprehensive post, but if your back-end uses a decent query builder, you could technically program a method that works for implementing the first/last/before/after pagination required by Relay GraphQL onto ANY pre-existing query. The only requirement is that the tables you are sorting on all have a column that uniquely represents the default order of the records (usually if your primary key is an integer and is using auto-generated IDs, you can use that one, even though technically ordering a table by its primary key will not always yield the same result as returning the table unordered)

Forget about base64 for a moment and just assume the ID to be a valid cursor field that represents the default order of the table.

The answer you find online for using a cursor is usually this.

SELECT * FROM TABLE T
WHERE T.id > $cursorId;

Well, this works great to get all the entries after the cursor, AS LONG as you don't apply any other sorts to the query. Once you use a custom sort like in your example, this suggestion breaks down.

However the core logic in there can be re-applied for queries with sorts, but the solution needs to be broadened. Let's try to come up with the complete algorithm.


Algorithm for first n after c (first n nodes after cursor)

A node or edge is the same as a row in SQL terminology. (if 1 row represents a single entity, such as 1 author)

While the cursor is the row after which we will start returning sibling rows, be it forwards or backwards.

Given C is the cursor

A is any other row being compared to C.

T is the table of which both A and C are rows.

And v w x y z are 5 columns on table T, naturally both A and C have these columns.

The algorithm has to decide whether A is included or excluded from the return query based on the cursor object, given n, and the provided orders of these 5 columns.

Let's start with a single order.

Given there is 1 order (v): (which there should always be at the very least, if we assume our table to be ordered by its primary key by default) To show the first n records, we will need to apply a limit of n, that is trivial. The difficult part is after c.

For a table which is only being ordered by 1 field that would come down to:

 SELECT A FROM T
 WHERE A.v > C.v
 ORDER BY T.v ASC
 LIMIT n

This should show all rows which have a bigger v than C, and remove all rows who's v is smaller than that of C, meaning there will not be any rows left before C. If we assume the primary key correctly represents the natural order, we can drop the ORDER BY statement. Then a slightly more readable version of this query would become:

 SELECT A FROM T
 WHERE A.id > $cursorIdGivenByClient
 LIMIT n

And there, we've arrived at the simplest solution for providing a cursor to an 'unsorted' table. Which is the same solutation as the commonly accepted answer for dealing with cursors, but incomplete alas.

Now let's look at a query that is sorted by two columns (v and w):

 SELECT A FROM T
 WHERE A.v > C.v
 OR (A.v = C.v AND A.w > C.w)
 ORDER BY T.v ASC, T.w ASC
 LIMIT n

We start off with the same WHERE A.v > C.v, any row for which value v (A.v) is less than value of C for the first sort (C.v) is removed from the output result. However if the columns for the first order v have the same value for both A and C, A.v = C.v we need to look at the second order column to see if A is still allowed to be shown in the query result. Which will be the case if A.w > C.w

Let's move on to a query with 3 sorts:

 SELECT A FROM T
 WHERE A.v > C.v
 OR (A.v = C.v AND A.w > C.w)
 OR (A.v = C.v AND A.w = C.w AND A.x > C.x)
 ORDER BY T.v ASC, T.w ASC, T.x ASC
 LIMIT n

This is the same logic as for 2 sorts but a little bit more worked out. If the first column is the same, we need to look at the 2nd column to see who's the biggest one. If the second column is ALSO the same, we need to look at the 3rd column. It's important to realize that the primary key is always the last sort column in the ORDER BY statement, and the last condition to be compared against. In this case A.x > C.x (or A.id > $cursorId)

Anyway a pattern should start to arise. For sorting on 4 columns the query would be like this:

 SELECT A FROM T
 WHERE A.v > C.v
 OR (A.v = C.v AND A.w > C.w)
 OR (A.v = C.v AND A.w = C.w AND A.x > C.x)
 OR (A.v = C.v AND A.w = C.w AND A.x = C.x AND A.y > C.y)
 ORDER BY T.v ASC, T.w ASC, T.x ASC, T.y ASC
 LIMIT n

And finally for sorting on 5 columns.

 SELECT A FROM T
 WHERE A.v > C.v
 OR (A.v = C.v AND A.w > C.w)
 OR (A.v = C.v AND A.w = C.w AND A.x > C.x)
 OR (A.v = C.v AND A.w = C.w AND A.x = C.x AND A.y > C.y)
 OR (A.v = C.v AND A.w = C.w AND A.x = C.x AND A.y = C.y AND A.z > C.z)
 ORDER BY T.v ASC, T.w ASC, T.x ASC, T.y ASC, T.z ASC
 LIMIT n

That's a scary amount of comparisons. For every order added, the number of comparisons required to calculate first n after c grows by the Triangular Number performed on each row. Luckily we can apply some boolean algebra to condense and optimize this query.

 SELECT A FROM T
 WHERE (A.v > C.v OR
           (A.v = C.v AND 
              (A.w > C.w OR
                   (A.w = C.w AND
                       (A.x > C.x OR
                           (A.x = C.x AND
                               (A.y > C.y OR
                                    (A.y = C.y AND
                                        (A.z > C.z)))))))))
 ORDER BY T.v ASC, T.w ASC, T.x ASC, T.y ASC, T.z ASC
 LIMIT n

Even after condensing it, the pattern is quite clear. Every condition line alters between OR an AND, and every condition line alters between > and = , finally every 2 condition lines we compare the next order column.

And this comparison is surprisingly performant as well. On average half of all rows will qualify after the first A.v > C.v check and stop there. And of the other half that do get through, the majority will fail at the second A.v = C.v check and stop there. So while it may generate big queries, I wouldn't be too worried about performance.

But let's get concrete and use this to give you an answer on how to use a cursor for the example in question:

 SELECT authors.id, authors.last_name, authors.created_at FROM authors
 ORDER BY authors.last_name, author.created_at

Is your base query, sorted, but not yet paginated.

Your server receives a request to show "first 20 authors after author with cursor" After decoding the cursor, we find out that it represents the author with id 15.

First we can run a small precursor query to get the necessary information we will need:

 $authorLastName, $authorCreatedAt =
      SELECT authors.last_name, authors.created_at from author where id = 15;

Then we apply the algorithm and substitute the fields:

  SELECT a.id, a.last_name, a.created_at FROM authors a
  WHERE (a.last_name > $authorLastName OR
            (a.last_name = $authorLastName AND 
               (a.created_at > $authorCreatedAt OR
                    (a.created_at = $authorCreatedAt AND
                        (a.id > 15)))))
 ORDER BY a.last_name, a.created_at, a.id
 LIMIT 20;

There this query will correctly return the first 20 authors after the author with ID 15 according to the sorts of the query.

If you don't like using variables or secondary queries you can use subqueries as well:

  SELECT a.id, a.last_name, a.created_at FROM authors a
  WHERE (a.last_name > (select last_name from authors where id 15) OR
            (a.last_name = (select last_name from authors where id 15) AND 
               (a.created_at > (select created_at from authors where id 15)  OR
                    (a.created_at = (select created_at from authors where id 15) AND
                        (a.id > 15)))))
 ORDER BY a.last_name, a.created_at, a.id
 LIMIT 20;

Again this isn't as bad as it seems, the subqueries are not correlated and the results will be cached over row loops, so it won't be particularly bad for performance. But the query does become messy, especially when you start using JOINS which will need to be applied in the subqueries as well.

You wouldn't need to explicitly call the ORDER on a.id, but I do it to be consistent with the algorithm. It does become very important if you're using DESC instead of ASC.

So what happens if you use DESC columns instead of ASC? Does the algorithm break? Well not if you apply a small extra rule. For whichever column is using DESC instead of ASC, you replace the '>' sign with '<' and the algorithm will now work for sorting in both directions.

JOINS have no impact on this algorithm (thank god), other than the fact that 20 rows from joined tables won't necessarily represent 20 entities (20 authors in this case), but that's a problem that is independent of the whole first/after matter and which you will also have using OFFSET.

It's also not particularly difficult to handle queries which already have pre-existing WHERE conditions. You just take all the pre-existing conditions, wrap them between brackets, and combine them with an AND statement to the conditions generated by the algorithm.

There, we've implemented an algorithm that can handle any input query and properly paginate it using first/after. (If there are edge cases I missed, do let me know)

And you could stop there but... unfortunately

You still need to handle first n, last n, before c, after c, last n before c, last n after c and first n before c if you want to be compliant with the GraphQL Relay specs and get rid of offset completely :).

You can get halfway using the given AFTER-algorithm I just provided. But for the other half you will need to use the BEFORE-algorithm. It's very similar to the AFTER algorithm:

 SELECT A FROM T
 WHERE (A.v < C.v OR
           (A.v = C.v AND 
              (A.w < C.w OR
                   (A.w = C.w AND
                       (A.x < C.x OR
                           (A.x = C.x AND
                               (A.y < C.y OR
                                    (A.y = C.y AND
                                        (A.z < C.z)))))))))
 ORDER BY T.v ASC, T.w ASC, T.x ASC, T.y ASC, T.z ASC
 LIMIT n

To get the BEFORE-algorithm, you take the AFTER-algorithm and just switch all '<' operators to '>' operators and vice versa. (So in essence before and after are the same algorithm with BEFORE/AFTER + ASC/DESC deciding which direction the operator will have to point to.)

For 'first n' you don't need to do anything except apply 'LIMIT n' to the query.

For 'last n' you need to apply 'LIMIT n' and reverse all given ORDERS , switching ASC with DESC and DESC with ASC. There is one caveat with the 'last n' , while it will correctly return the last n records, it will do so in reversed order, so you need to manually reverse the returned set again, be it in your database or inside your code.

There with those rules you can successfully integrate any pagination requests from the Relay GraphQL spec onto any SQL query, using a unique sortable column, often the primary key, as cursor that represents the source of truth for default sorting of the table.

It's quite daunting but I managed to write a plugin for Doctrine DQL builder using those algorithms to implement first/last/before/after pagination methods using a MySQL database. So it's definitely doable.

like image 134
Hydde87 Avatar answered Oct 05 '22 21:10

Hydde87