Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join operation with MongoDB MapReduce

I've been using MapReduce before to perform classical MR operation, the equivalent of GROUP BY in SQL.

I was wondering if it would be conceptually possible to perform a JOIN operation with MapReduce. Any idea how that could be implemented? Does it make sense to use MapReduce for this kind of operation?

Thanks!

like image 527
Johanisma Avatar asked May 18 '11 10:05

Johanisma


2 Answers

MongoDB doesn't support relational operations likes joins. Instead, you can denormalise your data by embedding the rows you'd JOIN on inside the outer document. So instead of joining Products to Sales, you could have a products collection with this schema:

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
    sales:
    [ 
        { id:1, date: "20100316", howMany: 2 },
        { id:2, date: "20100316", howMany: 5 }
    ]
}

Then whenever you retrieve a product, you also get its sales data so there's no need to join or lookup the info somewhere else.

Alternatively, you could split into two collections as you might with a relational database, then use an additional query to get a product's sales, something like this:

SQL: SELECT Sales WHERE ProductId = 123

MongoDB: db.sales.find( { productid: 123 } )

products

{
    _id: 123,
    name: "Widget",
    price: 9.99
}

sales

{
    id: 1,
    productid: 123,
    date: "20100316",
    howMany: 2 
}

{
    id: 2,
    productid: 123,
    date: "20100316",
    howMany: 5
}
like image 103
Chris Fulstow Avatar answered Oct 11 '22 14:10

Chris Fulstow


My approach is below :

having a look to hadoop I have find CompositeInputFormat approach brefily, it takes two or more collections as an input for map-reduce job

according to my investigation mongodb dont provide this yet. mongodb mapReduce is performed on one colletion at a time.(please correct if I am worng)

so I have decided to put the collections that need to be joined in one collection on wich I will perform the mapreduce for "sql right join"

this is from my log reporter project. the first phase map-reduce is enough to perform right join in case "no clock". the second phase map-reduce has the aim to exclude superfluous right join caused by clock field.

db.test.drop();
db.test.insert({"username" : 1, "day" : 1, "clock" : 0 });
db.test.insert({"username" : 1, "day" : 1, "clock" : 1 });
db.test.insert({"username" : 1,  startDay : 1,endDay:2, "table" : "user" });

//startDay : 1,endDay:2 are used to define the employers working day (join to company - left the company)
//you can use an array instedad of array here. for example day:[1,2,3, ...]

m1 = function(){
   if( typeof this.table!= "undefined" && this.table!=null){
       username = this.username;
       startDay = this.startDay;
       endDay   = this.endDay;
       while(startDay<=endDay){
           emit({username:username,day:startDay},{clocks:["join"]});
          // emit({username:username,day:startDay},1);
           startDay++;
       }
   }else{
       emit({username:this.username,day:this.day},{clocks:[this.clock]});
   }
}
r1 = function(key,values){
    result = {clocks:[]}
    values.forEach(function(x){
        result.clocks = x.clocks.concat(result.clocks);
        result.clocks=result.clocks.filter(function(element, index, array){
            return element!="join";            
        })
    })
    return result;
}

db.test.mapReduce(m1,r1,{out:"result1"})
db.test.find();
db.result1.find();

m2=function(){
   key=this._id;
   this.value.clocks.forEach(function(x){
       key.clock=x;
       emit(key,1);       
   })   
}
r2 = function(key,values){
    value=0;
    values.forEach(function(x){
        value+=1;      
    })
    return result;
}

db.result1.mapReduce(m2,r2,{out:"result2"})
db.test.find();
db.result2.find();
like image 32
telman shahbazov Avatar answered Oct 11 '22 13:10

telman shahbazov