I'd like to perform the expensive operation of cross product across two data sets in Hadoop using Java MapReduce.
For example, I have records from data set A and data set B, and I'd like each record in data set A to be matched up to each record in data set B in the output. I realize that the output size of this would be |A| * |B|
, but want to do it anyways.
I see that Pig has CROSS
but am unaware of how it is implemented at a high-level. Perhaps I will go take a look at the source code.
Not looking for any code, just want to know at a high-level how I should approach this problem.
I have done something similar when looking at document similarity (comparing a document to every other document) and ended up with a custom input format that splits up the two datasets and then ensured there was a 'split' for each subset of data.
So your splits would look like (each merging two sets of 10 records, outputting 100 records)
A(1-10) x B(1-10)
A(11-20) x B(1-10)
A(21-30) x B(1-10)
A(1-10) x B(11-20)
A(11-20) x B(11-20)
A(21-30) x B(11-20)
A(1-10) x B(21-30)
A(11-20) x B(21-30)
A(21-30) x B(21-30)
I don't remember how performant it was though, but had a document set in the size order of thousands to compare against one another (on an 8 node dev cluster), with millions of cross products calculated.
I could also make improvements to the algorithm as some documents would never score well against others (if they had too much temporal time between them for example), and generate better splits as a result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With