Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cross product in MapReduce

I'd like to perform the expensive operation of cross product across two data sets in Hadoop using Java MapReduce.

For example, I have records from data set A and data set B, and I'd like each record in data set A to be matched up to each record in data set B in the output. I realize that the output size of this would be |A| * |B|, but want to do it anyways.

I see that Pig has CROSS but am unaware of how it is implemented at a high-level. Perhaps I will go take a look at the source code.

Not looking for any code, just want to know at a high-level how I should approach this problem.

like image 220
Donald Miner Avatar asked Apr 28 '12 17:04

Donald Miner


1 Answers

I have done something similar when looking at document similarity (comparing a document to every other document) and ended up with a custom input format that splits up the two datasets and then ensured there was a 'split' for each subset of data.

So your splits would look like (each merging two sets of 10 records, outputting 100 records)

A(1-10) x B(1-10)
A(11-20) x B(1-10)
A(21-30) x B(1-10)
A(1-10) x B(11-20)
A(11-20) x B(11-20)
A(21-30) x B(11-20)
A(1-10) x B(21-30)
A(11-20) x B(21-30)
A(21-30) x B(21-30)

I don't remember how performant it was though, but had a document set in the size order of thousands to compare against one another (on an 8 node dev cluster), with millions of cross products calculated.

I could also make improvements to the algorithm as some documents would never score well against others (if they had too much temporal time between them for example), and generate better splits as a result.

like image 66
Chris White Avatar answered Sep 20 '22 06:09

Chris White