spark - filter within map

Question

I am trying to filter inside map function. Basically the way I'll do that in classic map-reduce is mapper wont write anything to context when filter criteria meet. How can I achieve similar with spark? I can't seem to return null from map function as it fails in shuffle step. I can either use filter function but it seems unnecessary iteration of data set while I can perform same task during map. I can also try to output null with dummy key but thats a bad workaround.

maasg · Accepted Answer

There are few options:

rdd.flatMap: rdd.flatMap will flatten a Traversable collection into the RDD. To pick elements, you'll typically return an Option as result of the transformation.

rdd.flatMap(elem => if (filter(elem)) Some(f(elem)) else None)

rdd.collect(pf: PartialFunction) allows you to provide a partial function that can filter and transform elements from the original RDD. You can use all power of pattern matching with this method.

rdd.collect{case t if (cond(t)) => f(t)}
rdd.collect{case t:GivenType => f(t)}

As Dean Wampler mentions in the comments, rdd.map(f(_)).filter(cond(_)) might be as good and even faster than the other more 'terse' options mentioned above.

Where f is a transformation (or map) function.

spark - filter within map

Tags:

java

apache-spark

nir

1 Answers

maasg

Recent Activity

Donate For Us

spark - filter within map

Tags:

java

apache-spark

nir

1 Answers

maasg

Related questions

Recent Activity

Donate For Us