Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB - simulate join or subquery

Tags:

mongodb

I'm trying to figure out the best way to structure my data in Mongo to simulate what would be a simple join or subquery in SQL.

Say I have the classic Users and Posts example, with Users in one collection and Posts in another. I want to find all posts by users who's city is "london".

I've simplified things in this question, in my real world scenario storing Posts as an array in the User document won't work as I have 1,000's of "posts" per user constantly inserting.

Can Mongos $in operator help here? Can $in handle an array of 10,000,000 entries?

like image 411
Kong Avatar asked Jul 07 '10 21:07

Kong


2 Answers

Honestly, if you can't fit "Posts" into "Users", then you have two options.

  1. Denormalize some User data inside of posts. Then you can search through just the one collection.
  2. Do two queries. (one to find users the other find posts)

Based on your question, you're trying to do #2.

Theoretically, you could build a list of User IDs (or refs) and then find all Posts belonging to a User $in that array. But obviously that approach is limited.

Can $in handle an array of 10,000,000 entries?

Look, if you're planning to "query" your posts for all users in a set of 10,000,000 Users you are well past the stage of "query". You say yourself that each User has 1,000s of posts so you're talking about a query for "Users with Posts who live in London" returning 100Ms of records.

100M records isn't a query, that's a dataset!

If you're worried about breaking the $in command, then I highly suggest that you use map/reduce. The Mongo Map/Reduce will create a new collection for you. You can then trim down or summarize this dataset as you see fit.

like image 181
Gates VP Avatar answered Oct 25 '22 08:10

Gates VP


$in can handle 100,000 entries. I've never tried 10,000,000 entries but the query (a query is also a document) has to be smaller than 4mb (like every document) so 10,0000,0000 entries isn't possible.

Why don't you include the user and its town in the Posts collection? You can index this town because you can index properties of embedded entities. You no longer have to simulate a join because you can query the Posts on the towns of its embedded users.

This means that you have to update the Posts when the town of a user changes but that doesn't happen very often. This update will be fast if you index the UserId in the Posts collection.

like image 43
TTT Avatar answered Oct 25 '22 07:10

TTT