Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mongodb aggregation framework match by nested documents

I have the following list of documents:

{
    "_id" : "Tvq579754r",
    "name": "Tom",
    "forms": {
           "PreOp":{
             "status":"closed"          
           },

           "Alert":{
             "status":"closed"          
           },

           "City":{
              "status":"closed"         
           },

          "Country":{
             "status":"closed"          
          } 
    }
},
....
{
    "_id" : "Tvq444454j",
    "name": "Jim",
    "forms": {
          "Jorney":{
             "status":"closed"          
           },

          "Women":{
             "status":"void"            
          },

         "Child":{
            "status":"closed"           
         },

         "Farm":{
           "status":"closed"            
         }  
     }
}

I want to filter them by their 'status' field('forms.name_of_form.status'). I need fetch all documents which don't have 'forms.name_of_form.status' equal 'void'.

Expected result is (document without voided form status):

{
    "_id" : "Tvq579754r",
    "name": "Tom",
    "forms": {
           "PreOp":{
             "status":"closed"          
           },

           "Alert":{
             "status":"closed"          
           },

           "City":{
              "status":"closed"         
           },

          "Country":{
             "status":"closed"          
          } 
    }
}
like image 439
Andrii Furmanets Avatar asked Feb 07 '14 09:02

Andrii Furmanets


1 Answers

Querying this structure for the results you want is not possible without knowing all of the possible forms names beforehand, and using them in the query. It would be very messy at any rate. That said, read on as I explain how it can be done.

There is a problem with the structure of these documents that is going to prevent you doing any reasonable query analysis. As it stands you would have to know all the possible form name fields in order to filter out anything.

Your current structure has forms containing a sub-document, of which each key contains another sub-document with a single property, status. This is difficult to traverse as your forms element has an arbitrary structure for each document you create. That means the pattern to descend to the status information you want to compare changes for every document in your collection.

Here is what I mean by path. To get at status in any element you have to do the following

Forms -> PreOp -> status

Forms -> Alert -> status

With the second element changing all the time. There is no way to wildcard something like this as the naming is considered explicit.

This may have been considered an easy way to implement serializing the data from your forms but I see a more flexible alternative. What you need is a document structure you can traverse in a standard pattern. This is always something worth considering in design. Take the following:

{
    "_id" : "Tvq444454j",
    "name": "Jim",
    "forms": [
        {
             "name": "Jorney",
             "status":"closed"          
        },
        {
            "name": "Women",
            "status":"void"            
        },
        {
            "name": "Child",
            "status":"closed"           
        },
        {
            "name": "Farm",
            "status":"closed"            
        }  
    ]
}

So the structure of the document is changed to make the forms element an Array, and rather than place the status field under a key that names the "form field" we have each member of the Array as a sub-document cotaining the "form field" name and the status. So both the identifier and the status are still paired together but just represented as a sub-document now. This most importantly changes the access path to these keys, as now for both the field name and it's status we can do

Forms -> status

or

Forms -> name

What this means is that you can query to find the names of all the fields in the form or all the status fields in the form, or even all the documents with a certain name field and certain status. That is much better than what could be done with the original structure.

Now in your particular case, you want to get only the documents where all the fields are not void. Now there is no way in a single query to do this as there is no operator to compare all the elements in a an array in this way and see if they are the same. But there are two approaches you can take:

The first and probably not as efficient is to query for all documents that contain an element in forms that has a status of "void". With the resulting document Id's you can issue another query that returns the documents that do not have the Id's that were specified.

db.forms.find({ "forms.status": "void" },{ _id: 1})

db.forms.find({ _id: $not: { $in: [<Object1>,<Object2>,<Object3>,... ] } })

Given result size this may not be possible and is generally not a good idea as the exclusion operator $not is basically forcing a full scan of the collection, so you couldn't use an index.

Another approach is to use the aggregation pipeline as follows:

db.forms.aggregate([
    { "$unwind": "$forms" },
    { "$group": { "_id": "$_id", "status": { "$addToSet": "$forms.status" }}},
    { "$unwind": "$status" },
    { "$sort": { "_id": 1, "status": -1 }},
    { "$group": { "_id": "$_id", "status": { "$first": "$status"}}},
    { "$match":{ "status": "closed" }}
])

Of course that will only return the _id for the documents that match, but you can issue a query with $in and return the whole matching documents. This is better than the exclusion operator used before, and now we can use an index to avoid full collection scans.

As a final approach and for the best performance consideration, you could change the document again so that at the top level you keep the "status" of whether any field in the forms in "void" or "closed". So at the top level the value would be closed only if all the items were "closed" and "void" if something were void, and so on.

That final one would mean a further programmatic change, and all changes to the forms items would need to update this field as well to maintain the "status". It is however the most efficient way of finding the documents you need and may be worth consideration.


EDIT:

Aside from changing the document to have a master status, fastest query form on the revised structure is actually:

db.forms.find({ "forms": { "$not": { "$elemMatch": { "status": "void" } } } })
like image 175
Neil Lunn Avatar answered Sep 22 '22 23:09

Neil Lunn