Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pagination Strategies for Complex (slow) Datasets

What are some of the strategies being used for pagination of data sets that involve complex queries? count(*) takes ~1.5 sec so we don't want to hit the DB for every page view. Currently there are ~45k rows returned by this query.

Here are some of the approaches I've considered:

  • Cache the row count and update it every X minutes
  • Limit (and offset) the rows counted to 41 (for example) and display the page picker as "1 2 3 4 ..."; then recompute if anyone actually goes to page 4 and display "... 3 4 5 6 7 ..."
  • Get the row count once and store it in the user's session
  • Get rid of the page picker and just have a "Next Page" link
like image 685
Jerry Fernholz Avatar asked Nov 09 '09 16:11

Jerry Fernholz


1 Answers

I've had to engineer a few pagination strategies using PHP and MySQL for a site that does over a million page views a day. I persued the strategy in stages:

Multi-column indexes I should have done this first before attempting a materialized view.

Generating a materialized view. I created a cron job that did a common denormalization of the document tables I was using. I would SELECT ... INTO OUTFILE ... and then create the new table, and rotate it in:

SELECT ... INTO OUTFILE '/tmp/ondeck.txt' FROM mytable ...;
CREATE TABLE ondeck_mytable LIKE mytable;
LOAD DATA INFILE '/tmp/ondeck.txt' INTO TABLE ondeck_mytable...;
DROP TABLE IF EXISTS dugout_mytable;
RENAME TABLE atbat_mytable TO dugout_mytable, ondeck_mytable TO atbat_mytable;

This kept the lock time on the write contended mytable down to a minimum and the pagination queries could hammer away on the atbat materialized view. I've simplified the above, leaving out the actual manipulation, which are unimportant.

Memcache I then created a wrapper about my database connection to cache these paginated results into memcache. This was a huge performance win. However, it was still not good enough.

Batch generation I wrote a PHP daemon and extracted the pagination logic into it. It would detect changes mytable and periodically regenerate the from the oldest changed record to the most recent record all the pages to the webserver's filesystem. With a bit of mod_rewrite, I could check to see if the page existed on disk, and serve it up. This also allowed me to take effective advantage of reverse proxying by letting Apache detect If-Modified-Since headers, and respond with 304 response codes. (Obviously, I removed any option of allowing users to select the number of results per page, an unimportant feature.)

Updated: RE count(*): When using MyISAM tables, COUNT didn't create a problem when I was able to reduce the amount of read-write contention on the table. If I were doing InnoDB, I would create a trigger that updated an adjacent table with the row count. That trigger would just +1 or -1 depending on INSERT or DELETE statements.

RE page-pickers (thumbwheels) When I moved to agressive query caching, thumb wheel queries were also cached, and when it came to batch generating the pages, I was using temporary tables--so computing the thumbwheel was no problem. A lot of thumbwheel calculation simplified because it became a predictable filesystem pattern that actually only needed the largest page numer. The smallest page number was always 1.

Windowed thumbweel The example you give above for a windowed thumbwheel (<< 4 [5] 6 >>) should be pretty easy to do without any queries at all so long as you know your maximum number of pages.

like image 155
memnoch_proxy Avatar answered Sep 19 '22 23:09

memnoch_proxy