Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

when to use mapParitions and mapPartitionsWithIndex?

The PySpark documentation describes two functions:

mapPartitions(f, preservesPartitioning=False)

   Return a new RDD by applying a function to each partition of this RDD.

   >>> rdd = sc.parallelize([1, 2, 3, 4], 2)
   >>> def f(iterator): yield sum(iterator)
   >>> rdd.mapPartitions(f).collect()
   [3, 7]

And ...

mapPartitionsWithIndex(f, preservesPartitioning=False)

   Return a new RDD by applying a function to each partition of this RDD, 
   while tracking the index of the original partition.

   >>> rdd = sc.parallelize([1, 2, 3, 4], 4)
   >>> def f(splitIndex, iterator): yield splitIndex
   >>> rdd.mapPartitionsWithIndex(f).sum()
   6

What use cases do these functions attempt to solve? I can't see why they would be required.

like image 742
Chris Snow Avatar asked Nov 11 '15 17:11

Chris Snow


People also ask

What is mapPartitionsWithIndex?

The mapPartitionsWithIndex(f) is similar to map but runs the f function separately on each partition and provides an index of the partition.

What is the difference between map and mapPartitions in spark?

mapPartitions() – This is precisely the same as map(); the difference being, Spark mapPartitions() provides a facility to do heavy initializations (for example, Database connection) once for each partition instead of doing it on every DataFrame row.

What is MapPartitionsRDD?

MapPartitionsRDD is an RDD that applies the provided function f to every partition of the parent RDD. By default, it does not preserve partitioning — the last input parameter preservesPartitioning is false . If it is true , it retains the original RDD's partitioning.


1 Answers

To answer this question we need to compare map with mapPartitions/mapPartitionsWithIndex (mapPartitions and mapPartitionsWithIndex pretty much do the same thing except with mapPartitionsWithIndex you can track which partition is being processed).

Now mapPartitions and mapPartitionsWithIndex are used to optimize the performance of your application. Just for the sake of understanding let's say all the elements in your RDD are XML elements and you need a parser to process each of them. So you have to take an instance of a good parser class to move ahead with. You could do it in two ways:

map + foreach: In this case for each element, an instance of the parser class will be created, the element will be processed and then the instance will be destroyed in time but this instance will not be used for other elements. So if you are working with an RDD of 12 elements distributed among 4 partitions, the parser instance will be created 12 times. And as you know creating an instance is a very expensive operation so it will take time.

mapPartitions/mapPartitionsWithIndex: These two methods are able to address the above situation a little bit. mapPartitions/mapPartitionsWithIndex works on the partitions, not on the elements (please don't get me wrong, all elements will be processed). These methods will create the parser instance once for each partition. And as you have only 4 partitions, the parser instance will be created 4 times (for this example 8 times less than map). But the function you will pass to these methods should take an Iterator object (to take all the elements of a partition at once as input). So in case of mapPartitions and mapPartitionsWithIndex the parser instance will be created, all elements for the current partition will be processed, and then the instance will be destroyed later by GC. And you will notice that they can improve the performance of your application significantly.

So the bottom-line is, whenever you see that some operations are common to all elements, and in general, you could do it once and could process all of them, it's better to go with mapPartitions/mapPartitionsWithIndex.

Please find the below two links for explanations with code example: https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/ http://apachesparkbook.blogspot.in/2015/11/mappartition-example.html

like image 157
Mrinal Avatar answered Oct 29 '22 21:10

Mrinal