Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identifying Duplicates in CouchDB

I'm new to CouchDB and document-oriented databases in general.

I've been playing around with CouchDB, and was able to get familiar with creating documents (with perl) and using the Map/Reduce functions in Futon to query the data and create views.

One of the things I'm still trying to figure out is how to identify duplicate values across documents using Futon's Map/Reduce.

For example, if I have the following documents:

{
  "_id": "123",
  "name": "carl",
  "timestamp": "2012-01-27T17:06:03Z"
}

{
  "_id": "124",
  "name": "carl",
  "timestamp": "2012-01-27T17:07:03Z"
}

And I wanted to get a list of document id's that had duplicate "name" values, is this something I could do with the Futon Map/Reduce?

The result was hoping to achieve is as follows:

{
  "name": "carl",
  "dupes": [ "123", "124" ]
}

..or..

{
  "carl": [ "123", "124" ]
}

.. which would be the value, and associated document ids which contain those duplicate values.

I've tried a few different things with Map/Reduce, but so far as I understand, the Map function works with data on a per-document basis, and the Reduce functions only allow you to work with the keys/values from a given document.

I know i could just pull the data I need with perl, work magic there, and get the result I want, but I'm trying to work only with CouchDB for now in order to better understand it's benefits / limitations.

Another way I'm thinking about doing this is to use a single document like an RDBMS table:

{
  "_id": "names",
  "rec1": {
    "_id": "123",
    "name": "carl",
    "timestamp": "2012-01-27T17:06:03Z"
  },
  "rec2": {
    "_id": "124",
    "name": "carl",
    "timestamp": "2012-01-27T17:07:03Z"
  }
}

.. which should allow me to use the Map/Reduce functions in the way I originally thought. However I'm not sure if this is ideal.

I understand that my mind is still stuck in RDBMS land, so much of what I'm trying to do above may not be necessary. Any insight on this would be much appreciated.

Thanks!

Edit: Fixed JSON syntax in some of the examples.

like image 764
jbobbylopez Avatar asked Jan 27 '12 18:01

jbobbylopez


1 Answers

If you merely want a list of unique values, that's pretty easy. If you wish to identify the duplicates, then it gets less easy.

In both cases, a map function like this should suffice:

function (doc) {
   emit(doc.name);
}

For your reduce function, just enter _count.

Your view output will look like: (based on your 2 documents)

{
    "rows": [
        { "key": "carl", "value": 2 }
    ]
}

From there, you will have a list of names as well as their frequency. You can take that list and filter it yourself, or you can take the "all couch" route and use a _list function to perform that final filtering.

function (head, req) {
    var row, duplicates = [];
    while (row = getRow()) {
        if (row.value > 1) {
            duplicates.push(row);
        }
    }
    send(JSON.stringify(duplicates));
}

Read up about _list functions, they're pretty handy and versatile.

like image 80
Dominic Barnes Avatar answered Oct 05 '22 23:10

Dominic Barnes