Scala Spark: Split collection into several RDD?

Tags:

apache-spark

Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For example:

def main(args: Array[String]) {
    val logFile = "file.txt" 
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
    val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")
  }

In this example I have to iterate 'logData` twice just to write results in two separate files:

    val lineAs = logData.filter(line => line.contains("a")).saveAsTextFile("linesA.txt")
    val lineBs = logData.filter(line => line.contains("b")).saveAsTextFile("linesB.txt")

It would be nice instead to have something like this:

    val resultMap = logData.map(line => if line.contains("a") ("a", line) else if line.contains("b") ("b", line) else (" - ", line)
    resultMap.writeByKey("a", "linesA.txt") 
    resultMap.writeByKey("b", "linesB.txt")

Any such thing?

615

asked Dec 01 '14 15:12

1 Answers

Maybe something like this would work:

def singlePassMultiFilter[T](
      rdd: RDD[T],
      f1: T => Boolean,
      f2: T => Boolean,
      level: StorageLevel = StorageLevel.MEMORY_ONLY
  ): (RDD[T], RDD[T], Boolean => Unit) = {
  val tempRDD = rdd mapPartitions { iter =>
    val abuf1 = ArrayBuffer.empty[T]
    val abuf2 = ArrayBuffer.empty[T]
    for (x <- iter) {
      if (f1(x)) abuf1 += x
      if (f2(x)) abuf2 += x
    }
    Iterator.single((abuf1, abuf2))
  }
  tempRDD.persist(level)
  val rdd1 = tempRDD.flatMap(_._1)
  val rdd2 = tempRDD.flatMap(_._2)
  (rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
}

Note that an action called on rdd1 (resp. rdd2) will cause tempRDD to be computed and persisted. This is practically equivalent to computing rdd2 (resp. rdd1) since the overhead of the flatMap in the definitions of rdd1 and rdd2 are, I believe, going to be pretty negligible.

You would use singlePassMultiFitler like so:

val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
rdd1.persist()    //I'm going to need `rdd1` more later...
println(rdd1.count)  
println(rdd2.count) 
cleanUp(true)     //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
println(rdd1.distinct.count)

Clearly this could extended to an arbitrary number of filters, collections of filters, etc.

answered Oct 02 '22 20:10

Jason Lenderman

Related questions
                            
                                Scala "does not take parameters" when pattern-matching parametric case-class
                            
                                Override Predef's implicit conversions
                            
                                How do I increase PermGen space for Scala compilation under Gradle?
                            
                                Map function of RDD not being invoked in Scala Spark
                            
                                How to display entire stack trace for thrown exceptions from ScalaCheck tests?
                            
                                Scala Akka and Protocol Buffers
                            
                                How to mock function returning AnyVal with Mockito in Scala / Specs2?
                            
                                Json4s custom serializer with unordered fields
                            
                                How to avoid losing type information
                            
                                Scala's @throws annotation is ignored in javac if I declare the variable as its abstract superclass
                            
                                How to create projection class for complex case class in slick?
                            
                                Setting package in Scala REPL
                            
                                Transitive LUB?
                            
                                Will scala compiler hoist regular expressions
                            
                                Accessing values from path-dependent type mixin
                            
                                Why TypeTag doesnt have method runtimeClass but Manifest and ClassTag do
                            
                                Scala idiom for partial models?
                            
                                How can I filter with inSetBind for multiple columns in Slick?
                            
                                Avoid `Boolean.box`
                            
                                akka timeout when using spray client for multiple request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scala Spark: Split collection into several RDD?

Tags:

scala

apache-spark

zork

People also ask

1 Answers

Jason Lenderman

Recent Activity

Donate For Us