I've been using MapReduce before to perform classical MR operation, the equivalent of GROUP BY in SQL.
I was wondering if it would be conceptually possible to perform a JOIN operation with MapReduce. Any idea how that could be implemented? Does it make sense to use MapReduce for this kind of operation?
Thanks!
MongoDB doesn't support relational operations likes joins. Instead, you can denormalise your data by embedding the rows you'd JOIN on inside the outer document. So instead of joining Products to Sales, you could have a products
collection with this schema:
products
{
_id: 123,
name: "Widget",
price: 9.99
sales:
[
{ id:1, date: "20100316", howMany: 2 },
{ id:2, date: "20100316", howMany: 5 }
]
}
Then whenever you retrieve a product, you also get its sales data so there's no need to join or lookup the info somewhere else.
Alternatively, you could split into two collections as you might with a relational database, then use an additional query to get a product's sales, something like this:
SQL: SELECT Sales WHERE ProductId = 123
MongoDB: db.sales.find( { productid: 123 } )
products
{
_id: 123,
name: "Widget",
price: 9.99
}
sales
{
id: 1,
productid: 123,
date: "20100316",
howMany: 2
}
{
id: 2,
productid: 123,
date: "20100316",
howMany: 5
}
My approach is below :
having a look to hadoop I have find CompositeInputFormat approach brefily, it takes two or more collections as an input for map-reduce job
according to my investigation mongodb dont provide this yet. mongodb mapReduce is performed on one colletion at a time.(please correct if I am worng)
so I have decided to put the collections that need to be joined in one collection on wich I will perform the mapreduce for "sql right join"
this is from my log reporter project. the first phase map-reduce is enough to perform right join in case "no clock". the second phase map-reduce has the aim to exclude superfluous right join caused by clock field.
db.test.drop();
db.test.insert({"username" : 1, "day" : 1, "clock" : 0 });
db.test.insert({"username" : 1, "day" : 1, "clock" : 1 });
db.test.insert({"username" : 1, startDay : 1,endDay:2, "table" : "user" });
//startDay : 1,endDay:2 are used to define the employers working day (join to company - left the company)
//you can use an array instedad of array here. for example day:[1,2,3, ...]
m1 = function(){
if( typeof this.table!= "undefined" && this.table!=null){
username = this.username;
startDay = this.startDay;
endDay = this.endDay;
while(startDay<=endDay){
emit({username:username,day:startDay},{clocks:["join"]});
// emit({username:username,day:startDay},1);
startDay++;
}
}else{
emit({username:this.username,day:this.day},{clocks:[this.clock]});
}
}
r1 = function(key,values){
result = {clocks:[]}
values.forEach(function(x){
result.clocks = x.clocks.concat(result.clocks);
result.clocks=result.clocks.filter(function(element, index, array){
return element!="join";
})
})
return result;
}
db.test.mapReduce(m1,r1,{out:"result1"})
db.test.find();
db.result1.find();
m2=function(){
key=this._id;
this.value.clocks.forEach(function(x){
key.clock=x;
emit(key,1);
})
}
r2 = function(key,values){
value=0;
values.forEach(function(x){
value+=1;
})
return result;
}
db.result1.mapReduce(m2,r2,{out:"result2"})
db.test.find();
db.result2.find();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With