My Spark job contains a chain of map operations
JavaRDD<Row> rowRDD = raw
.javaRDD()
.mapPartitions(new CustomPartitionMapper())
.map(new DataSpecialMapper(config))
.map(new CsvFormatMapper(config))
.map(new ReportCounters());
From the programming point of view code more readable and testable. The question is about performance.
Will be the chain of mappers interpreted by Spark as one mapper operation and will be performed in the same executor or not? If not what may be the performance impact?
Thanks
Will be the chain of mappers interpreted by Spark as one mapper operation and will be performed in the same executor or not?
Spark will optimize multiple narrow transformations into a single stage, this means that the multiple map
tasks will be ran subsequently under the same task umbrella. (See this blog post for more)
But, you are still going to be iterating each of these partitions 4 * O(n) times, which is still O(n) but may at a given input size effect performance, which is always something to keep in mind.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With