Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to fetch the continuous list with PostgreSQL in web

I am making an API over HTTP that fetches many rows from PostgreSQL with pagination. In ordinary cases, I usually implement such pagination through naive OFFET/LIMIT clause. However, there are some special requirements in this case:

  • A lot of rows there are so that I believe users cannot reach the end (imagine Twitter timeline).
  • Pages does not have to be randomly accessible but only sequentially.
  • API would return a URL which contains a cursor token that directs to the page of continuous chunks.
  • Cursor tokens have not to exist permanently but for some time.
  • Its ordering has frequent fluctuating (like Reddit rankings), however continuous cursors should keep their consistent ordering.

How can I achieve the mission? I am ready to change my whole database schema for it!

like image 374
minhee Avatar asked Oct 12 '11 21:10

minhee


People also ask

How make PostgreSQL query run faster?

Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus. Photo by Richard Jacobs on Unsplash.

Can Postgres handle 1 billion rows?

As commercial database vendors are bragging about their capabilities we decided to push PostgreSQL to the next level and exceed 1 billion rows per second to show what we can do with Open Source. To those who need even more: 1 billion rows is by far not the limit - a lot more is possible.

Which join is faster in PostgreSQL?

Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won't be executed too often. It is the typical join strategy used in OLTP workloads with a normalized data model, where it is highly efficient.

Can you store lists in PostgreSQL?

PostgreSQL gives you this capability with the array datatype. Arrays are some of the most useful data types for storing lists of information.


2 Answers

Assuming it's only the ordering of the results that fluctuates and not the data in the rows, Fredrik's answer makes sense. However, I'd suggest the following additions:

  • store the id list in a postgresql table using the array type rather than in memory. Doing it in memory, unless you carefully use something like redis with auto expiry and memory limits, is setting yourself up for a DOS memory consumption attack. I imagine it would look something like this:

    create table foo_paging_cursor (
      cursor_token ..., -- probably a uuid is best or timestamp (see below)
      result_ids integer[], -- or text[] if you have non-integer ids
      expiry_time TIMESTAMP
    );
    
  • You need to decide if the cursor_token and result_ids can be shared between users to reduce your storage needs and the time needed to run the initial query per user. If they can be shared, chose a cache window, say 1 or 5 minute(s), and then upon a new request create the cache_token for that time period and then check to see if the results ids have already been calculated for that token. If not, add a new row for that token. You should probably add a lock around the check/insert code to handle concurrent requests for a new token.

  • Have a scheduled background job that purges old tokens/results and make sure your client code can handle any errors related to expired/invalid tokens.

Don't even consider using real db cursors for this.

Keeping the result ids in Redis lists is another way to handle this (see the LRANGE command), but be careful with expiry and memory usage if you go down that path. Your Redis key would be the cursor_token and the ids would be the members of the list.

like image 80
Tavis Rudd Avatar answered Oct 19 '22 15:10

Tavis Rudd


I know absolutely nothing about PostgreSQL, but I'm a pretty decent SQL Server developer, so I'd like to take a shot at this anyway :)

How many rows/pages do you expect a user would maximally browse through per session? For instance, if you expect a user to page through a maximum of 10 pages for each session [each page containing 50 rows], you could make take that max, and setup the webservice so that when the user requests the first page, you cache 10*50 rows (or just the Id:s for the rows, depends on how much memory/simultaneous users you got).

This would certainly help speed up your webservice, in more ways than one. And it's quite easy to implement to. So:

  • When a user requests data from page #1. Run a query (complete with order by, join checks, etc), store all the id:s into an array (but a maximum of 500 ids). Return datarows that corresponds to id:s in the array at positions 0-9.
  • When the user requests page #2-10. Return datarows that corresponds to id:s in the array at posisions (page-1)*50 - (page)*50-1.

You could also bump up the numbers, an array of 500 int:s would only occupy 2K of memory, but it also depends on how fast you want your initial query/response.

I've used a similar technique on a live website, and when the user continued past page 10, I just switched to queries. I guess another solution would be to continue to expand/fill the array. (Running the query again, but excluding already included id:s).

Anyway, hope this helps!

like image 39
Fredrik Johansson Avatar answered Oct 19 '22 16:10

Fredrik Johansson