Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark - filter within map

I am trying to filter inside map function. Basically the way I'll do that in classic map-reduce is mapper wont write anything to context when filter criteria meet. How can I achieve similar with spark? I can't seem to return null from map function as it fails in shuffle step. I can either use filter function but it seems unnecessary iteration of data set while I can perform same task during map. I can also try to output null with dummy key but thats a bad workaround.

like image 288
nir Avatar asked Mar 03 '15 22:03

nir


1 Answers

There are few options:

rdd.flatMap: rdd.flatMap will flatten a Traversable collection into the RDD. To pick elements, you'll typically return an Option as result of the transformation.

rdd.flatMap(elem => if (filter(elem)) Some(f(elem)) else None)

rdd.collect(pf: PartialFunction) allows you to provide a partial function that can filter and transform elements from the original RDD. You can use all power of pattern matching with this method.

rdd.collect{case t if (cond(t)) => f(t)}
rdd.collect{case t:GivenType => f(t)}

As Dean Wampler mentions in the comments, rdd.map(f(_)).filter(cond(_)) might be as good and even faster than the other more 'terse' options mentioned above.

Where f is a transformation (or map) function.

like image 145
maasg Avatar answered Sep 17 '22 00:09

maasg