Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the difference of two differently structured collections

Supposed I have two collections, A and B.

A contains simple documents of the following form:

{ _id: '...', value: 'A', data: '...' }
{ _id: '...', value: 'B', data: '...' }
{ _id: '...', value: 'C', data: '...' }
…

B contains nested objects like this:

{ _id: '...', values: [ 'A', 'B' ]}
{ _id: '...', values: [ 'C' ]}
…

Now what can happen is that there are documents in A that are not referenced by any document in B, or that there are referenced documents in B that are not existent in A.

Let's call them "orphaned".

My question now is: How do I find those orphaned documents, in a most efficient way? In the end, what I need is their _id field.

So far I have used unwind to "flatten" A, and calculated the difference using the differenceWith function of Ramda, but this takes quite a long time and is for sure not really efficient, as I do all the work on the client instead of in the database.

I have seen that there is a $setDifference operator in MongoDB, but I did not get it to work.

Can anyone point me to the right direction, how to solve this issues using Node.js, and running most (all?) of the work in the database? Any hints are appreciated :-)

like image 780
Golo Roden Avatar asked Jun 25 '15 09:06

Golo Roden


1 Answers

In MongoDb you can use the aggregation pipeline for what you are trying. If this doesn't help you can use MapReduce but it is a bit more complicated.

For this example I named the two collections "Tags" and "Papers", where Tags is named "B" in your example, and Papers would be "A".

First we get the set of values that actually exist and are referencing documents. For this, we flatten each value in the tags collection and pack it back together. Unwinding creates a document with the original _id for each value in the 'values' array. This flat list is then recollected and their ids ignored.

 var referenced_tags = db.tags.aggregate(
     {$unwind: '$values'},
     {$group: {
         _id: '', 
         tags: { $push: '$values'}
     }
 });

This returns:

{ "_id" : "", "tags" : [ "A", "B", "C"] }

This list is a collection of all values in all documents.

Then, you create a similar collection, containing the set of tags of the available documents. This doesn't need the unwind step, since the _id is a scalar value (=not a list)

var papers = db.papers.aggregate(
    {$group: { 
        _id: '', 
        tags: {$push: '$value'}
    }
});

yielding

{ "_id" : "", "tags" : [ "A", "B", "C", "D"] }

As you can already see, from the set that I put in the database, there appears to be a Document (Paper) in A with the id "D", that is not referenced in the tags collection and is thererfore an orphan.

You can now compute the difference set in any way you like, this might be slow but is suitable as an example:

var a = referenced_tags.tags;
var b = tags.tags;
var delta = a.filter(function (v) { return b.indexOf(v) < 0; });

As a next step, you can find the ids by looking for these values in delta, and projecting only their ids:

db.papers.find({'value' : {'$in': delta}}, {'_id': 1})

Returning:

{ "_id" : ObjectId("558bd2...44f6a") }

EDIT: While this nicely shows how to approach this problem with the aggregation framework, this is not a feasible solution. One doesn't even need aggregation, since MongoDb is quite smart:

db.papers.find({'value' : {'$nin': tags.values }}, {'_id': 1})

Where tags is

var cursor = db.tags.find();
var tags = cursor.hasNext() : cusor.next() : null;

As pointed out by @karthick.k

like image 73
cessor Avatar answered Oct 11 '22 23:10

cessor