MongoDB has an extremely straightforward way to implement pagination: using skip and limit operations on a cursor. skip(n) skips n items in a query, while limit(m) returns only the next m items starting from the n-th one.
The Limit() Method To limit the records in MongoDB, you need to use limit() method. The method accepts one number type argument, which is the number of documents that you want to be displayed.
Keyset pagination (also known as the "seek method") is used to fetch a subset of records from a table quickly. It does this by restricting the set of records returned with a combination of WHERE and LIMIT clauses.
It is perfectly fine to use ObjectId() though your syntax for pagination is wrong. You want:
db.tweets.find().limit(50).sort({"_id":-1});
This says you want tweets sorted by _id
value in descending order and you want the most recent 50. Your problem is the fact that pagination is tricky when the current result set is changing - so rather than using skip for the next page, you want to make note of the smallest _id
in the result set (the 50th most recent _id
value and then get the next page with:
db.tweets.find( {_id : { "$lt" : <50th _id> } } ).limit(50).sort({"_id":-1});
This will give you the next "most recent" tweets, without new incoming tweets messing up your pagination back through time.
There is absolutely no need to worry about whether _id
value is strictly corresponding to insertion order - it will be 99.999% close enough, and no one actually cares on the sub-second level which tweet came first - you might even notice Twitter frequently displays tweets out of order, it's just not that critical.
If it is critical, then you would have to use the same technique but with "tweet date" where that date would have to be a timestamp, rather than just a date.
Wouldn't a tweet "actual" timestamp (i.e. time tweeted and the criteria you want it sorted by) be different from a tweet "insertion" timestamp (i.e. time added to local collection). This depends on your application, of course, but it's a likely scenario that tweet inserts could be batched or otherwise end up being inserted in the "wrong" order. So, unless you work at Twitter (and have access to collections inserted in correct order), you wouldn't be able to rely just on $natural
or ObjectID
for sorting logic.
Mongo docs suggest skip
and limit
for paging:
db.tweets.find({created: {$lt: maxID}).
sort({created: -1, username: 1}).
skip(50).limit(50); //second page
There is, however, a performance concern when using skip:
The
cursor.skip()
method is often expensive because it requires the server to walk from the beginning of the collection or index to get the offset or skip position before beginning to return result. As offset increases,cursor.skip()
will become slower and more CPU intensive.
This happens because skip
does not fit into the MapReduce model and is not an operation that would scale well, you have to wait for a sorted collection to become available before it can be "sliced". Now limit(n)
sounds like an equally poor method as it applies a similar constraint "from the other end"; however with sorting applied, the engine is able to somewhat optimize the process by only keeping in memory n
elements per shard as it traverses the collection.
An alternative is to use range based paging. After retrieving the first page of tweets, you know what the created
value is for the last tweet, so all you have to do is substitute the original maxID
with this new value:
db.tweets.find({created: {$lt: lastTweetOnCurrentPageCreated}).
sort({created: -1, username: 1}).
limit(50); //next page
Performing a find
condition like this can be easily parallellized. But how to deal with pages other than the next one? You don't know the begin date for pages number 5, 10, 20, or even the previous page! @SergioTulentsev suggests creative chaining of methods but I would advocate pre-calculating first-last ranges of the aggregate field in a separate pages
collection; these could be re-calculated on update. Furthermore, if you're not happy with DateTime
(note the performance remarks) or are concerned about duplicate values, you should consider compound indexes on timestamp + account tie (since a user can't tweet twice at the same time), or even an artificial aggregate of the two:
db.pages.
find({pagenum: 3})
> {pagenum:3; begin:"01-01-2014@BillGates"; end:"03-01-2014@big_ben_clock"}
db.tweets.
find({_sortdate: {$lt: "03-01-2014@big_ben_clock", $gt: "01-01-2014@BillGates"}).
sort({_sortdate: -1}).
limit(50) //third page
Using an aggregate field for sorting will work "on the fold" (although perhaps there are more kosher ways to deal with the condition). This could be set up as a unique index with values corrected at insert time, with a single tweet document looking like
{
_id: ...,
created: ..., //to be used in markup
user: ..., //also to be used in markup
_sortdate: "01-01-2014@BillGates" //sorting only, use date AND time
}
The following approach wil work even if there are multiple documents inserted/updated at same millisecond even if from multiple clients (which generates ObjectId). For simiplicity, In following queries I am projecting _id, lastModifiedDate.
First page, fetch the result Sorted by modifiedTime (Descending), ObjectId (Ascending) for fist page.
db.product.find({},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
Note down the ObjectId and lastModifiedDate of the last record fetched in this page. (loid, lmd)
db.productfind({$or:[{"lastModifiedDate":{$lt:lmd}},{"_id":1,"lastModifiedDate":1},{$and:[{"lastModifiedDate":lmd},{"_id":{$gt:loid}}]}]},{"_id":1,"lastModifiedDate":1}).sort({"lastModifiedDate":-1, "_id":1}).limit(2)
repeat same for subsequent pages.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With