I have a quite a big dataset (100 million+ records with 100's of columns) that I am processing with spark. I am reading the data into a spark dataset and I want to filter this dataset and map a subset of its fields to a case class. the code looks somewhat similar, <pre class="prettyprint"><code>case class Subset(name:String,age:Int) case class Complete(name:String,field1:String,field2....,age:Int) val ds = spark.read.format("csv").load("data.csv").as[Complete] #approach 1 ds.filter(x=>x.age>25).map(x=> Subset(x.name,x.age)) #approach 2 ds.flatMap(x=>if(x.age>25) Seq(Subset(x.name,x.age)) else Seq.empty) </code></pre> Which approach is better? Any additional hints on how I can make this code more performant? Thanks! Edit I ran some tests to compare the runtimes and it looks like approach 2 is quite faster, the code i used for getting the runtimes is as follows, <pre class="prettyprint"><code>val subset = spark.time { ds.filter(x=>x.age>25).map(x=> Subset(x.name,x.age)) } spark.time { subset.count() } and val subset2 = spark.time { ds.flatMap(x=>if(x.age>25) Seq(Subset(x.name,x.age)) else Seq.empty) } spark.time { subset2.count() } </code></pre>

Update: My original answer contained an error: Spark does support <code>Seq</code> as the result of a <code>flatMap</code> (and converts the result back into an <code>Dataset</code>). Apologies for the confusion. I also added more information on improving the performance of your analysis. Update 2: I missed that you're using a <code>Dataset</code> rather than an <code>RDD</code> (doh!). This doesn't affect the answer significantly. Spark is a distributed system that partitions data across multiple nodes and processes data in parallel. In terms of efficiency, actions that result in re-partitioning (requiring data to be transferred between nodes) is far more expensive in terms of run-time than in-place modifications. Also, you should note that operations that merely transform data, such as <code>filter</code>, <code>map</code>, <code>flatMap</code>, etc. are merely stored and do not execute until an action operation is performed (such as <code>reduce</code>, <code>fold</code>, <code>aggregate</code>, etc.). Consequently, neither alternative actually does anything as things stand. When an action is performed on the result of these transformations, I would expect the <code>filter</code> operation to be far more efficient: it only processes data (using the subsequent <code>map</code> operation) that passes the predicate <code>x=>x.age>25</code> (more typically written as <code>_.age > 25</code>). While it may appear that <code>filter</code> creates an intermediary collection, it executes lazilly. As a result, Spark appears to fuse the <code>filter</code> and <code>map</code> operations together. Your <code>flatMap</code> operation is, frankly, hideous. It forces processing, sequence creation and subsequent flattening of every data item, which will definitely increase overall processing. That said, the best way to improve the performance of your analysis is to control the partitioning so that the data is split roughly evenly over as many nodes as possible. Refer to this guide as a good starting point.

Does flatmap give better performance than filter+map?

Tags:

scala

apache-spark

I have a quite a big dataset (100 million+ records with 100's of columns) that I am processing with spark. I am reading the data into a spark dataset and I want to filter this dataset and map a subset of its fields to a case class.

the code looks somewhat similar,

case class Subset(name:String,age:Int)
case class Complete(name:String,field1:String,field2....,age:Int)

val ds = spark.read.format("csv").load("data.csv").as[Complete]

#approach 1
ds.filter(x=>x.age>25).map(x=> Subset(x.name,x.age))

#approach 2
ds.flatMap(x=>if(x.age>25) Seq(Subset(x.name,x.age)) else Seq.empty)

Which approach is better? Any additional hints on how I can make this code more performant?

Thanks!

Edit

I ran some tests to compare the runtimes and it looks like approach 2 is quite faster, the code i used for getting the runtimes is as follows,

val subset = spark.time {
   ds.filter(x=>x.age>25).map(x=> Subset(x.name,x.age))
}

spark.time {
   subset.count()
}

and 

val subset2 = spark.time {
   ds.flatMap(x=>if(x.age>25) Seq(Subset(x.name,x.age)) else Seq.empty)
}

spark.time {
   subset2.count()
}

841

asked Jun 25 '19 20:06

Sai Kiran KrishnaMurthy

1 Answers

Update: My original answer contained an error: Spark does support Seq as the result of a flatMap (and converts the result back into an Dataset). Apologies for the confusion. I also added more information on improving the performance of your analysis.

Update 2: I missed that you're using a Dataset rather than an RDD (doh!). This doesn't affect the answer significantly.

Spark is a distributed system that partitions data across multiple nodes and processes data in parallel. In terms of efficiency, actions that result in re-partitioning (requiring data to be transferred between nodes) is far more expensive in terms of run-time than in-place modifications. Also, you should note that operations that merely transform data, such as filter, map, flatMap, etc. are merely stored and do not execute until an action operation is performed (such as reduce, fold, aggregate, etc.). Consequently, neither alternative actually does anything as things stand.

When an action is performed on the result of these transformations, I would expect the filter operation to be far more efficient: it only processes data (using the subsequent map operation) that passes the predicate x=>x.age>25 (more typically written as _.age > 25). While it may appear that filter creates an intermediary collection, it executes lazilly. As a result, Spark appears to fuse the filter and map operations together.

Your flatMap operation is, frankly, hideous. It forces processing, sequence creation and subsequent flattening of every data item, which will definitely increase overall processing.

That said, the best way to improve the performance of your analysis is to control the partitioning so that the data is split roughly evenly over as many nodes as possible. Refer to this guide as a good starting point.

138

answered Nov 08 '22 11:11

Mike Allen

Related questions
                            
                                Is there a way to ensure a type is Serializable at compile time
                            
                                Gatling scenario with 10 requests per hour (less that 1 rps)
                            
                                How to automate StructType creation for passing RDD to DataFrame
                            
                                Subtyping between function types
                            
                                Scalacheck Shrink
                            
                                error: not found: value assemblyJarName in assembly
                            
                                Is connection pooling in akka-http using the source queue Implementation thread safe?
                            
                                new-style ("inline") macros require scala.meta
                            
                                How does Scala use all my cores here?
                            
                                scalatest Flatspec: Timeout for entire class
                            
                                Scala lists with existential types: `map{ case t => ... }` works, `map{ t => ... }` doesn't?
                            
                                Effect abstraction in Cats and parallel execution
                            
                                akka http to use Json Support and xmlsupport
                            
                                Watermarking for Spark structured streaming with three way joins
                            
                                Graphx : Is it possible to execute a program on each vertex without receiving a message?
                            
                                Why GraalVM CE has smaller throughput than GraalVM EE or OpenJDK 8
                            
                                Correct usage of Either, Try and Exceptions/ControlThrowable in scala
                            
                                Akka streams pass through flow limiting Parallelism / throughput of processing flow
                            
                                Scala create multi-line JSON String
                            
                                IllegalStateException: _spark_metadata/0 doesn't exist while compacting batch 9

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With