Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Smart pagination algorithm that works with local data cache

This is a problem I have been thinking about for a long time but I haven't written any code yet because I first want to solve some general problems I am struggling with. This is the main one.

Background

A single page web application makes requests for data to some remote API (which is under our control). It then stores this data in a local cache and serves pages from there. Ideally, the app remains fully functional when offline, including the ability to create new objects.

Constraints

  • Assume a server side database of products containing +- 50000 products (50Mb)
  • Assume no db type, we interact with it via REST/GraphQL interface
  • Assume a single product record is < 1kB
  • Assume a max payload for a resultset of 256kB
  • Assume max 5MB storage on the client
  • Assume search result sets ranging between 0 ... 5000 items per search

Challenge

The challenge is to define a stateless but (network) efficient way fetch pages from a result set so that it is deterministic which results we will get.

Example

In traditional paging, when getting the next 100 results for some query using this url:

https://example.com/products?category=shoes&firstResult=100&pageSize=100

the search result may look like this:

{
  "totalResults": 2458,
  "firstResult": 100,
  "pageSize": 100,
  "results": [
    {"some": "item"},
    {"some": "other item"},
    // 98 more ...
  ]
}

The problem with this is that there is no way, based on this information, to get exactly the objects that are on a certain page. Because by the time we request the next page, the result set may have changed (due to changes in the DB), influencing which items are part of the result set. Even a small change can have a big impact: one item removed from the DB, that happened to be on page 0 of the result set, will change what results we will get when requesting all subsequent pages.

Goal

I am looking for a mechanism to make the definition of the result set independent of future database changes, so if someone was looking for shoes and got a result set of 2458 items, he could actually fetch all pages of that result set reliably even if it got influenced by later changes in the DB (I plan to not really delete items, but set a removed flag on them, for this purpose)

Ideas so far

I have seen a solution where the result set included a "pages" property, which was an array with the first and last id of the items in that page. Assuming your IDs keep going up in number and you don't really delete items from the DB ever, the number of items between two IDs is constant. Meaning the app could get all items between those two IDs and always get the exact same items back. The problem with this solution is that it only works if the list is sorted in ID order... I need custom sorting options.

The only way I have come up with for now is to just send a list of all IDs in the result set... That way pages can be fetched by doing a SELECT * FROM products WHERE id IN (3,4,6,9,...)... but this feels rather inelegant...

Any way I am hoping it is not too broad or theoretical. I have a web-based DB, just no good idea on how to do paging with it. I am looking for answers that help me in a direction to learn, not full solutions.

like image 965
Stijn de Witt Avatar asked Jan 31 '17 15:01

Stijn de Witt


2 Answers

Versioning DB is the answer for resultsets consistency. Each record has primary id, modification counter (version number) and timestamp of modification/creation. Instead of modification of record r you add new record with same id, version number+1 and sysdate for modification.

In fetch response you add DB request_time (do not use client timestamp due to possibly difference in time between client/server). First page is served normally, but you return sysdate as request_time. Other pages are served differently: you add condition like modification_time <= request_time for each versioned table.

like image 119
Alexander Anikin Avatar answered Jan 04 '23 06:01

Alexander Anikin


You can cache the result set of IDs on the server side when a query arrives for the first time and return a unique ID to the frontend. This unique ID corresponds to the result set for that query. So now the frontend can request something like next_page with the unique ID that it got the first time it made the query. You should still go ahead with your approach of changing DELETE operation to a removed operation because it would make sure that none of the entries from the result set it deleted. You can discard the result set of the query from the cache when the frontend reaches the end of the result set or you can set a time limit on the lifetime of the cache entry.

like image 43
deLta Avatar answered Jan 04 '23 07:01

deLta