Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

custom join with non equal keys

I need to implement a custom join strategy, that would match for non strictly equal keys. To illustrate, one can think about distance : the join should occur when the keys are close enough (although in my case, it s a bit more complicated than just a distance metric)

So I can't implement this by overriding equals, since there's no equality (and I need to keep a true equality test for other needs). And I suppose i also need to implement a proper partitioner.

How could I do that ?

like image 275
mathieu Avatar asked May 08 '15 20:05

mathieu


1 Answers

Convert the RDDs to DataFrames, then you can do a join like this:

val newDF = leftDF.join(rightDF, $"col1" < ceilingVal and $"col1" > floorVal)

You can then define UDFs that you can use in your join. So if you had a "distanceUDF" like this:

val distanceUDF = udf[Int, Int, Int]((val1, val2) => val2 - val1)

You could then do:

val newDF = leftDF.join(rightDF, distanceUDF($"left.colX", $"right.colY") < 10)
like image 107
David Griffin Avatar answered Nov 12 '22 19:11

David Griffin