I need to split an RDD into 2 parts: 1 part which satisfies a condition; another part which does not. I can do <code>filter</code> twice on the original RDD but it seems inefficient. Is there a way that can do what I'm after? I can't find anything in the API nor in the literature.

Spark RDD does not have such api. Here is a version based on a pull request for rdd.span that should work: <pre class="prettyprint"><code>import scala.reflect.ClassTag import org.apache.spark.rdd._ def split[T:ClassTag](rdd: RDD[T], p: T => Boolean): (RDD[T], RDD[T]) = { val splits = rdd.mapPartitions { iter => val (left, right) = iter.partition(p) val iterSeq = Seq(left, right) iterSeq.iterator } val left = splits.mapPartitions { iter => iter.next().toIterator} val right = splits.mapPartitions { iter => iter.next() iter.next().toIterator } (left, right) } val rdd = sc.parallelize(0 to 10, 2) val (first, second) = split[Int](rdd, _ % 2 == 0 ) first.collect // Array[Int] = Array(0, 2, 4, 6, 8, 10) </code></pre>

Apache Spark RDD filter into two RDDs

Tags:

I need to split an RDD into 2 parts:

1 part which satisfies a condition; another part which does not. I can do filter twice on the original RDD but it seems inefficient. Is there a way that can do what I'm after? I can't find anything in the API nor in the literature.

766

asked Apr 09 '15 19:04

monster

2 Answers

Spark doesn't support this by default. Filtering on the same data twice isn't that bad if you cache it beforehand, and the filtering itself is quick.

If it's really just two different types, you can use a helper method:

implicit class RDDOps[T](rdd: RDD[T]) {
  def partitionBy(f: T => Boolean): (RDD[T], RDD[T]) = {
    val passes = rdd.filter(f)
    val fails = rdd.filter(e => !f(e)) // Spark doesn't have filterNot
    (passes, fails)
  }
}

val (matches, matchesNot) = sc.parallelize(1 to 100).cache().partitionBy(_ % 2 == 0)

But as soon as you have multiple types of data, just assign the filtered to a new val.

answered Sep 30 '22 13:09

Marius Soutier

Spark RDD does not have such api.

Here is a version based on a pull request for rdd.span that should work:

import scala.reflect.ClassTag
import org.apache.spark.rdd._

def split[T:ClassTag](rdd: RDD[T], p: T => Boolean): (RDD[T], RDD[T]) = {

    val splits = rdd.mapPartitions { iter =>
        val (left, right) = iter.partition(p)
        val iterSeq = Seq(left, right)
        iterSeq.iterator
    }

    val left = splits.mapPartitions { iter => iter.next().toIterator}

    val right = splits.mapPartitions { iter => 
        iter.next()
        iter.next().toIterator
    }
    (left, right)
}

val rdd = sc.parallelize(0 to 10, 2)

val (first, second) = split[Int](rdd, _ % 2 == 0 )

first.collect
// Array[Int] = Array(0, 2, 4, 6, 8, 10)

answered Sep 30 '22 14:09

Shyamendra Solanki

Related questions
                            
                                Preloading Ecto Associations by default
                            
                                When does Android's SharedPreferences commit() return false?
                            
                                nvidia-smi GPU performance measure does not make sense
                            
                                Does Consul persist the Key Value store?
                            
                                Count occurrences of each key in python dictionary
                            
                                python argparse - pass values WITHOUT command line
                            
                                NuGet pack is ignoring assembly info
                            
                                Does linking C with C++ avoid undefined behavior that is legal in C but not C++?
                            
                                "Subversion command line client version is too old" error in Android Studio
                            
                                Java8 zoned date without time
                            
                                JavaScript object.hasOwnProperty(proName) vs lodash _.has(obj, proName) function
                            
                                Jupyter Notebook: How to relaunch all cells above when a crash occurs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With