Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding two documents in MongoDB that share a key value

I have a large collection of documents in MongoDB, each one of those documents has a key called "name", and another key called "type". I would like to find two documents with the same name and different types, a simple MongoDB counterpart of

SELECT ...
FROM table AS t1, table AS t2
WHERE t1.name = t2.name AND t1.type <> t2.type

I can imagine that one can do this using aggregation: however, the collection is very large, processing it will take time and I'm looking just for one pair of such documents.

like image 697
Alexander Serebrenik Avatar asked Mar 01 '14 13:03

Alexander Serebrenik


People also ask

How do I search multiple documents in MongoDB?

You can query for multiple documents in a collection with collection. find() . The find() method uses a query document that you provide to match the subset of the documents in the collection that match the query.

How do I reference other documents in MongoDB?

MongoDB applications use one of two methods to relate documents: Manual references save the _id field of one document in another document as a reference. Your application runs a second query to return the related data. These references are simple and sufficient for most use cases.

How does MongoDB store key-value pairs?

Every key-value pair is stored in a bucket, which is really just a MongoDB collection (the "bucket" terminology is used merely for resemblance with other key-value stores), so the same key can exist, with possibly different values, in multiple buckets.

How do you get data from two collections in MongoDB?

For performing MongoDB Join two collections, you must use the $lookup operator. It is defined as a stage that executes a left outer join with another collection and aids in filtering data from joined documents. For example, if a user requires all grades from all students, then the below query can be written: Students.


2 Answers

While I stand by by comments that I don't think the way you are phrasing your question is actually related to a specific problem you have, I will go someway to explain the idiomatic SQL way in a MongoDB type of solution. I stand on that your actual solution would be different but you haven't presented us with that problem, but only SQL.

So consider the following documents as a sample set, removing _id fields in this listing for clarity:

{ "name" : "a", "type" : "b" }
{ "name" : "a", "type" : "c" }
{ "name" : "b", "type" : "c" }
{ "name" : "b", "type" : "a" }
{ "name" : "a", "type" : "b" }
{ "name" : "b", "type" : "c" }
{ "name" : "f", "type" : "e" }
{ "name" : "z", "type" : "z" }
{ "name" : "z", "type" : "z" }

If we ran the SQL presented over the same data we would get this result:

a|b
a|c
a|c
b|c
b|a
b|a
a|b
b|c

We can see that 2 documents do not match, and then work out the logic of the SQL operation. So the other way of saying it is "Which documents given a key of "name" do have more than one possible value in the key "type".

Given that, taking a mongo approach, we can query for the items that do not match the given condition. So effectively the reverse of the result:

db.sample.aggregate([

    // Store unique documents grouped by the "name"
    {$group: { 
        _id: "$name",
        comp: {
            $addToSet: { 
                name:"$name",
                type: "$type" 
            }
        } 
    }},

    // Unwind the "set" results
    {$unwind: "$comp"},

    // Push the results back to get the unique count
    // *note* you could not have done this with alongside $addtoSet
    {$group: {
        _id: "$_id",
        comp: {
            $push: { 
                name: "$comp.name",
                type: "$comp.type" 
            }
        },
        count: {$sum: 1} 
    }},

    // Match only what was counted once
    {$match: {count: 1}},

    // Unwind the array
    {$unwind: "$comp"},

    // Clean up to "name" and "type" only
    {$project: { _id: 0, name: "$comp.name", type: "$comp.type"}}

])

This operation will yield the results:

{ "name" : "f", "type" : "e" }
{ "name" : "z", "type" : "z" }

Now in order to get the same result as the SQL query we would take those results and channel them into another query:

db.sample.find({$nor: [{ name: "f", type: "e"},{ name: "z", type: "z"}] })

Which arrives as the final matching result:

{ "name" : "a", "type" : "b" }
{ "name" : "a", "type" : "c" }
{ "name" : "b", "type" : "c" }
{ "name" : "b", "type" : "a" }
{ "name" : "a", "type" : "b" }
{ "name" : "b", "type" : "c" }

So this will work, however the one thing that may make this impractical is where the number of documents being compared is very large, we hit a working limit on compacting those results down to an array.

It also suffers a bit from the use of a negative in the final find operation which would force a scan of the collection. But in all fairness the same could be said of the SQL query that uses the same negative premise.

Edit

Of course what I did not mention is that if the result set goes the other way around and you are matching more results in the excluded items from the aggregate, then just reverse the logic to get the keys that you want. Simply change $match as follows:

{$match: {$gt: 1}}

And that will be the result, maybe not the actual documents but it is a result. So you don't need another query to match the negative cases.

And, ultimately this was my fault because I was so focused on the idiomatic translation that I did not read the last line in your question, where to do say that you were looking for one document.

Of course, currently if that result size is larger than 16MB then you are stuck. At least until the 2.6 release, where the results of aggregation operations are a cursor, so you can iterate that like a .find().

Also introduced in 2.6 is the $size operator which is used to find the size of an array in the document. So this would help to remove the second $unwind and $group that are used in order to get the length of the set. This alters the query to a faster form:

db.sample.aggregate([
    {$group: { 
        _id: "$name",
        comp: {
            $addToSet: { 
                name:"$name",
                type: "$type"
            }
        } 
    }},
    {$project: { 
        comp: 1,
        count: {$size: "$comp"} 
    }},
    {$match: {count: {$gt: 1}}},
    {$unwind: "$comp"},
    {$project: { _id: 0, name: "$comp.name", type: "$comp.type"}}
])

And MongoDB 2.6.0-rc0 is currently available if you are doing this just for personal use, or development/testing.


Moral of the story. Yes you can do it, But do you really want or need to do it that way? Then probably not, and if you asked a different question about the specific business case, you may get a different answer. But then again this may be exactly right for what you want.

Note

Worthwhile to mention that when you look at the results from the SQL, it will erroneously duplicate several items due to the other available type options if you didn't use a DISTINCT for those values or essentially another grouping. But that is the result that was being produced by this process using MongoDB.

For Alexander

This is the output of the aggregate in the shell from current 2.4.x versions:

{
    "result" : [
            {
                    "name" : "f",
                    "type" : "e"
            },
            {
                    "name" : "z",
                    "type" : "z"
            }
    ],
    "ok" : 1
}

So do this to get a var to pass as the argument to the $nor condition in the second find, like this:

var cond = db.sample.aggregate([ .....

db.sample.find({$nor: cond.result })

And you should get the same results. Otherwise consult your driver.

like image 97
Neil Lunn Avatar answered Oct 13 '22 07:10

Neil Lunn


There is a very simple aggregation that works to get you the names and their types that occur more than once:

db.collection.aggregate([
      { $group: { _id : "$name", 
        count:{$sum:1},
        types:{$addToSet:"$type"}}},
      {$match:{"types.1":{$exists:true}}}
])

This works in all versions that support aggregation framework.

like image 28
Asya Kamsky Avatar answered Oct 13 '22 07:10

Asya Kamsky