Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB Array Query Performance

I'm trying to figure out what the best schema is for a dating site like app. User's have a listing (possibly many) and they can view other user listings to 'like' and 'dislike' them.

Currently i'm just storing the other persons listing id in a likedBy and dislikedBy array. When a user 'likes' a listing, it puts their listing id into the 'liked' listings arrays. However I would now like to track the timestamp that a user likes a listing. This would be used for a user's 'history list' or for data analysis.

I would need to do two separate queries:

find all active listings that this user has not liked or disliked before

and for a user's history of 'liked'/'disliked' choices

find all the listings user X has liked in chronological order

My current schema is:

listings
  _id: 'sdf3f'
  likedBy: ['12ac', 'as3vd', 'sadf3']
  dislikedBy: ['asdf', 'sdsdf', 'asdfas']
  active: bool

Could I do something like this?

listings
  _id: 'sdf3f'
  likedBy: [{'12ac', date: Date}, {'ds3d', date: Date}]
  dislikedBy: [{'s12ac', date: Date}, {'6fs3d', date: Date}]
  active: bool

I was also thinking of making a new collection for choices.

choices
  Id
  userId          // id of current user making the choice
  userlistId      // listing of the user making the choice
  listingChoseId  // the listing they chose yes/no
  type
  date

I'm not sure of the performance implications of having these choices in another collection when doing the find all active listings that this user has not liked or disliked before.

Any insight would be greatly appreciated!

like image 454
SkinnyGeek1010 Avatar asked Apr 08 '14 20:04

SkinnyGeek1010


People also ask

How fast are MongoDB queries?

How fast are MongoDB queries? Pretty darn fast. Primary key or index queries should take just a few milliseconds. Queries without indexes depend on collection size and machine specs, etc.

Does indexing improve query performance MongoDB?

Indexes also improve efficiency on queries that routinely sort on a given field. Because MongoDB can read indexes in both ascending and descending order, the direction of a single-key index does not matter. Indexes support queries, update operations, and some phases of the aggregation pipeline.

What are the EXPLAIN PLAN results for queries for MongoDB?

The explain plan results for queries are subject to change between MongoDB versions. The cursor.explain ("executionStats") and the db.collection.explain ("executionStats") methods provide statistics about the performance of a query. These statistics can be useful in measuring if and how a query uses an index.

What are indexes in MongoDB and how do they work?

If your application queries a collection on a particular field or set of fields, then an index on the queried field or a compound index on the set of fields can prevent the query from scanning the whole collection to find and return the query results. For more information about indexes, see the complete documentation of indexes in MongoDB.

How do I optimize a timestamp query in MongoDB?

If you regularly issue a query that sorts on the timestamp field, then you can optimize the query by creating an index on the timestamp field: Because MongoDB can read indexes in both ascending and descending order, the direction of a single-key index does not matter.

What are collection scans in MongoDB?

Collection scans indicate that the mongod had to scan the entire collection document by document to identify the results. This is a generally expensive operation and can result in slow queries. executionStats.nReturned displays 3 to indicate that the query matches and returns three documents.


1 Answers

Well you obviously thought it was a good idea to have these embedded in the "listings" documents so your additional usage patterns to the cases presented here worked properly. With that in mind there is no reason to throw that away.

To clarify though, the structure you seem to want is something like this:

{
    "_id": "sdf3f",
    "likedBy": [
         { "userId": "12ac",  "date": ISODate("2014-04-09T07:30:47.091Z") },
         { "userId": "as3vd", "date": ISODate("2014-04-09T07:30:47.091Z") },
         { "userId": "sadf3", "date": ISODate("2014-04-09T07:30:47.091Z") }
    ],
    "dislikedBy": [
        { "userId": "asdf",   "date": ISODate("2014-04-09T07:30:47.091Z") },
        { "userId": "sdsdf",  "date": ISODate("2014-04-09T07:30:47.091Z") },
        { "userId": "asdfas", "date": ISODate("2014-04-09T07:30:47.091Z") }
    ],
    "active": true
}

Which is all well and fine except that there is one catch. Because you have this content in two array fields you would not be able to create an index over both of those fields. That is a restriction where only one array type of field (or multikey) can be be included within a compound index.

So to solve the obvious problem with your first query not being able to use an index, you would structure like this instead:

{
    "_id": "sdf3f",
    "votes": [
        { 
            "userId": "12ac",
            "type": "like", 
            "date": ISODate("2014-04-09T07:30:47.091Z")
        },
        {
            "userId": "as3vd",
            "type": "like",
            "date": ISODate("2014-04-09T07:30:47.091Z")
        },
        { 
            "userId": "sadf3", 
            "type": "like", 
            "date": ISODate("2014-04-09T07:30:47.091Z")
        },
        { 
            "userId": "asdf", 
            "type": "dislike",
            "date": ISODate("2014-04-09T07:30:47.091Z")
        },
        {
            "userId": "sdsdf",
            "type": "dislike", 
            "date": ISODate("2014-04-09T07:30:47.091Z")
        },
        { 
            "userId": "asdfas", 
            "type": "dislike",
            "date": ISODate("2014-04-09T07:30:47.091Z")
        }
    ],
    "active": true
}

This allows an index that covers this form:

db.post.ensureIndex({
    "active": 1,
    "votes.userId": 1, 
    "votes.date": 1, 
    "votes.type": 1 
})

Actually you will probably want a few indexes to suit your usage patterns, but the point is now can have indexes you can use.

Covering the first case you have this form of query:

db.post.find({ "active": true, "votes.userId": { "$ne": "12ac" } })

That makes sense considering that you clearly are not going to have both an like and dislike option for each user. By the order of that index, at least active can be used to filter because your negating condition needs to scan everything else. No way around that with any structure.

For the other case you probably want the userId to be in an index before the date and as the first element. Then your query is quite simple:

db.post.find({ "votes.userId": "12ac" })
    .sort({ "votes.userId": 1, "votes.date": 1 })

But you may be wondering that you suddenly lost something in that getting the count of "likes" and "dislikes" was as easy as testing the size of the array before, but now it's a little different. Not a problem that cannot be solved using aggregate:

db.post.aggregate([
    { "$unwind": "$votes" },
    { "$group": {
       "_id": {
           "_id": "$_id",
           "active": "$active"
       },
       "likes": { "$sum": { "$cond": [
           { "$eq": [ "$votes.type", "like" ] },
           1,
           0
       ]}},
       "dislikes": { "$sum": { "$cond": [
           { "$eq": [ "$votes.type", "dislike" ] },
           1,
           0
       ]}}
])

So whatever your actual usage form you can store any important parts of the document to keep in the grouping _id and then evaluate the count of "likes" and "dislikes" in an easy manner.

You may also not that changing an entry from like to dislike can also be done in a single atomic update.

There is much more you can do, but I would prefer this structure for the reasons as given.

like image 87
Neil Lunn Avatar answered Sep 22 '22 11:09

Neil Lunn