I am trying to filter inside map function. Basically the way I'll do that in classic map-reduce is mapper wont write anything to context when filter criteria meet. How can I achieve similar with spark? I can't seem to return null from map function as it fails in shuffle step. I can either use filter function but it seems unnecessary iteration of data set while I can perform same task during map. I can also try to output null with dummy key but thats a bad workaround.
There are few options:
rdd.flatMap
: rdd.flatMap
will flatten a Traversable
collection into the RDD. To pick elements, you'll typically return an Option
as result of the transformation.
rdd.flatMap(elem => if (filter(elem)) Some(f(elem)) else None)
rdd.collect(pf: PartialFunction)
allows you to provide a partial function that can filter and transform elements from the original RDD. You can use all power of pattern matching with this method.
rdd.collect{case t if (cond(t)) => f(t)}
rdd.collect{case t:GivenType => f(t)}
As Dean Wampler mentions in the comments, rdd.map(f(_)).filter(cond(_))
might be as good and even faster than the other more 'terse' options mentioned above.
Where f
is a transformation (or map) function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With