I need to implement a custom join strategy, that would match for non strictly equal keys. To illustrate, one can think about distance : the join should occur when the keys are close enough (although in my case, it s a bit more complicated than just a distance metric)
So I can't implement this by overriding equals, since there's no equality (and I need to keep a true equality test for other needs). And I suppose i also need to implement a proper partitioner.
How could I do that ?
Convert the RDDs to DataFrames, then you can do a join like this:
val newDF = leftDF.join(rightDF, $"col1" < ceilingVal and $"col1" > floorVal)
You can then define UDFs that you can use in your join. So if you had a "distanceUDF" like this:
val distanceUDF = udf[Int, Int, Int]((val1, val2) => val2 - val1)
You could then do:
val newDF = leftDF.join(rightDF, distanceUDF($"left.colX", $"right.colY") < 10)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With