Spark: coalesce very slow even the output data is very small

Tags:

I have the following code in Spark:

myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .saveAsTextFile("myOutput")

There are 2000+ files in the myOutput folder, but only a few t.getMyEnum() == null, so there are only very few output records. Since I don't want to search just a few outputs in 2000+ output files, I tried to combine the output using coalesce like below:

myData.filter(t => t.getMyEnum() == null)
      .map(t => t.toString)
      .coalesce(1, false)
      .saveAsTextFile("myOutput")

Then the job becomes EXTREMELY SLOW! I am wondering why it is so slow? There was just a few output records scattering in 2000+ partitions? Is there a better way to solve this problem?

641

asked Jun 25 '15 17:06

Edamame

1 Answers

if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

Note: With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner.

So try by passing the true to coalesce function. i.e.

myData.filter(_.getMyEnum == null)
      .map(_.toString)
      .coalesce(1, shuffle = true)
      .saveAsTextFile("myOutput")

119

answered Sep 21 '22 19:09

Zia Kiyani

Related questions
                            
                                Why is a Set a function?
                            
                                Creating infix operators in Scala
                            
                                error: not found: type SparkConf
                            
                                Scala AST in Scala [closed]
                            
                                generator/block to iterator/stream conversion
                            
                                When to use Mapper or Record in Lift?
                            
                                Is there a FIFO stream in Scala?
                            
                                Scala CQRS Framework [closed]
                            
                                Play Framework - Redirect with params
                            
                                What is purpose of anorm's Pk?
                            
                                scala slick query return value
                            
                                sbt 0.12.4 - there were x feature warning(s); re-run with -feature for details
                            
                                Play 2.6 ActionBuilder
                            
                                Rotate the first argument to a function to become nth
                            
                                SBT: Exclude class from Jar
                            
                                Why does `Some(123).isInstanceOf[Option[List[String]]]` *not* give an unchecked warning?
                            
                                running multiple tests within the same FakeApplication() in play 2.0 scala
                            
                                Scala: why does a `for` comprehension on a Map sometimes yield a List?
                            
                                Attempting to model F-bounded polymorphism as a type member in Scala
                            
                                Why prefer Typeclass over Inheritance?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: coalesce very slow even the output data is very small

Tags:

scala

apache-spark

coalesce

Edamame

People also ask

1 Answers

Zia Kiyani

Recent Activity

Donate For Us