Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Range-based, chronological pagination queries across multiple collections with MongoDB?

Is there an efficient way to do a range-based query across multiple collections, sorted by an index on timestamps? I basically need to pull in the latest 30 documents from 3 collections and the obvious way would be to query each of the collections for the latest 30 docs and then filter and merge the result. However that's somewhat inefficient.

Even if I were to select only for the timestamp field in the query then do a second batch of queries for the latest 30 docs, I'm not sure that be a better approach. That would be 90 documents (whole or single field) per pagination request.

Essentially the client can be subscribed to articles and each category of article differs by 0 - 2 fields. I just picked 3 since that is the average number of articles that users are subscribed to so far in the beta. Because of the possible field differences, I didn't think it would be very consistent to put all of the articles of different types in a single collection.

like image 641
paulkon Avatar asked Oct 03 '22 03:10

paulkon


2 Answers

MongoDB operations operate on one and only one collection at a time. Thus you need to structure your schema with collections that match your query needs.

Option A: Get Ids from supporting collection, load full docs, sort in memory

So you need to either have a collection that combines the ids, main collection names, and timestamps of the 3 collections into a single collection, and query that to get your 30 ID/collection pairs, and then load the corresponding full documents with 3 additional queries (1 to each main collection), and of course remember those won't come back in correct combined order, so you need to sort that page of results manually in memory before returning it to your client.

{
  _id: ObjectId,
  updated: Date,
  type: String
}

This way allows mongo to do the pagination for you.

Option B: 3 Queries, Union, Sort, Limit

Or as you said load 30 documents from each collection, sort the union set in memory, drop the extra 60, and return the combined result. This avoids the extra collection overhead and synchronization maintenance.

So I would think your current approach (Option B as I call it) is the lesser of those 2 not-so-great options.

like image 103
Peter Lyons Avatar answered Oct 05 '22 16:10

Peter Lyons


If your query is really to get the most recent articles based on a selection of categories, then I'd suggest you:

A) Store all of the documents in a single collection so they can utilize a a single query for fetching a combine paged result. Unless you have a very consistent date range across collections, you'll need to track date ranges and counts so that you can reasonably fetch a set of documents that can be merged. 30 from one collection may be older than all from another. You can add an index for timestamp and category and then limit the results.

B) Cache everything aggressively so that you rarely need to do the merges

like image 40
WiredPrairie Avatar answered Oct 05 '22 18:10

WiredPrairie