Identifying Duplicates in CouchDB

Question

I'm new to CouchDB and document-oriented databases in general.

I've been playing around with CouchDB, and was able to get familiar with creating documents (with perl) and using the Map/Reduce functions in Futon to query the data and create views.

One of the things I'm still trying to figure out is how to identify duplicate values across documents using Futon's Map/Reduce.

For example, if I have the following documents:

{
  "_id": "123",
  "name": "carl",
  "timestamp": "2012-01-27T17:06:03Z"
}

{
  "_id": "124",
  "name": "carl",
  "timestamp": "2012-01-27T17:07:03Z"
}

And I wanted to get a list of document id's that had duplicate "name" values, is this something I could do with the Futon Map/Reduce?

The result was hoping to achieve is as follows:

{
  "name": "carl",
  "dupes": [ "123", "124" ]
}

..or..

{
  "carl": [ "123", "124" ]
}

.. which would be the value, and associated document ids which contain those duplicate values.

I've tried a few different things with Map/Reduce, but so far as I understand, the Map function works with data on a per-document basis, and the Reduce functions only allow you to work with the keys/values from a given document.

I know i could just pull the data I need with perl, work magic there, and get the result I want, but I'm trying to work only with CouchDB for now in order to better understand it's benefits / limitations.

Another way I'm thinking about doing this is to use a single document like an RDBMS table:

{
  "_id": "names",
  "rec1": {
    "_id": "123",
    "name": "carl",
    "timestamp": "2012-01-27T17:06:03Z"
  },
  "rec2": {
    "_id": "124",
    "name": "carl",
    "timestamp": "2012-01-27T17:07:03Z"
  }
}

.. which should allow me to use the Map/Reduce functions in the way I originally thought. However I'm not sure if this is ideal.

I understand that my mind is still stuck in RDBMS land, so much of what I'm trying to do above may not be necessary. Any insight on this would be much appreciated.

Thanks!

Edit: Fixed JSON syntax in some of the examples.

Dominic Barnes · Accepted Answer

If you merely want a list of unique values, that's pretty easy. If you wish to identify the duplicates, then it gets less easy.

In both cases, a map function like this should suffice:

function (doc) {
   emit(doc.name);
}

For your reduce function, just enter _count.

Your view output will look like: (based on your 2 documents)

{
    "rows": [
        { "key": "carl", "value": 2 }
    ]
}

From there, you will have a list of names as well as their frequency. You can take that list and filter it yourself, or you can take the "all couch" route and use a _list function to perform that final filtering.

function (head, req) {
    var row, duplicates = [];
    while (row = getRow()) {
        if (row.value > 1) {
            duplicates.push(row);
        }
    }
    send(JSON.stringify(duplicates));
}

Read up about _list functions, they're pretty handy and versatile.

Identifying Duplicates in CouchDB

Tags:

json

perl

couchdb

mapreduce

couchdb-futon

jbobbylopez

1 Answers

Dominic Barnes

Recent Activity

Donate For Us

Identifying Duplicates in CouchDB

Tags:

json

perl

couchdb

mapreduce

couchdb-futon

jbobbylopez

1 Answers

Dominic Barnes

Related questions

Recent Activity

Donate For Us