Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Chain of map operations in Spark performance

Tags:

My Spark job contains a chain of map operations

JavaRDD<Row> rowRDD = raw
            .javaRDD()
            .mapPartitions(new CustomPartitionMapper())
            .map(new DataSpecialMapper(config))
            .map(new CsvFormatMapper(config))
            .map(new ReportCounters());

From the programming point of view code more readable and testable. The question is about performance.

Will be the chain of mappers interpreted by Spark as one mapper operation and will be performed in the same executor or not? If not what may be the performance impact?

Thanks

like image 528
David Greenshtein Avatar asked Jan 12 '17 16:01

David Greenshtein


1 Answers

Will be the chain of mappers interpreted by Spark as one mapper operation and will be performed in the same executor or not?

Spark will optimize multiple narrow transformations into a single stage, this means that the multiple map tasks will be ran subsequently under the same task umbrella. (See this blog post for more)

But, you are still going to be iterating each of these partitions 4 * O(n) times, which is still O(n) but may at a given input size effect performance, which is always something to keep in mind.

like image 75
Yuval Itzchakov Avatar answered Sep 21 '22 10:09

Yuval Itzchakov