Chain of map operations in Spark performance

Question

My Spark job contains a chain of map operations

JavaRDD<Row> rowRDD = raw
            .javaRDD()
            .mapPartitions(new CustomPartitionMapper())
            .map(new DataSpecialMapper(config))
            .map(new CsvFormatMapper(config))
            .map(new ReportCounters());

From the programming point of view code more readable and testable. The question is about performance.

Will be the chain of mappers interpreted by Spark as one mapper operation and will be performed in the same executor or not? If not what may be the performance impact?

Thanks

Yuval Itzchakov · Accepted Answer

Will be the chain of mappers interpreted by Spark as one mapper operation and will be performed in the same executor or not?

Spark will optimize multiple narrow transformations into a single stage, this means that the multiple map tasks will be ran subsequently under the same task umbrella. (See this blog post for more)

But, you are still going to be iterating each of these partitions 4 * O(n) times, which is still O(n) but may at a given input size effect performance, which is always something to keep in mind.

Chain of map operations in Spark performance

Tags:

David Greenshtein

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us

Chain of map operations in Spark performance

Tags:

David Greenshtein

1 Answers

Yuval Itzchakov

Related questions

Recent Activity

Donate For Us