Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the pyspark mapPartitions function work?

So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions work. That is what Input it takes and what Output it gives. I couldn't find any proper example from the internet. Lets say, I have an RDD object containing lists, such as below.

[ [1, 2, 3], [3, 2, 4], [5, 2, 7] ]  

And I want to remove element 2 from all the lists, how would I achieve that using mapPartitions.

like image 916
MetallicPriest Avatar asked Nov 04 '14 17:11

MetallicPriest


People also ask

How does Apache spark mapPartitions work?

MapPartition works on a partition at a time. MapPartition returns after processing all the rows in the partition. MapPartition output is retained in memory, as it can return after processing all the rows in a particular partition.

What does mapPartitions return?

Return a new RDD by applying a function to each partition of this RDD.

What does map function do in PySpark?

In PySpark, the map (map()) is defined as the RDD transformation that is widely used to apply the transformation function (Lambda) on every element of Resilient Distributed Datasets(RDD) or DataFrame and further returns a new Resilient Distributed Dataset(RDD).


1 Answers

mapPartition should be thought of as a map operation over partitions and not over the elements of the partition. It's input is the set of current partitions its output will be another set of partitions.

The function you pass to map operation must take an individual element of your RDD

The function you pass to mapPartition must take an iterable of your RDD type and return an iterable of some other or the same type.

In your case you probably just want to do something like:

def filter_out_2(line):     return [x for x in line if x != 2]  filtered_lists = data.map(filterOut2) 

If you wanted to use mapPartition it would be:

def filter_out_2_from_partition(list_of_lists):   final_iterator = []   for sub_list in list_of_lists:     final_iterator.append( [x for x in sub_list if x != 2])   return iter(final_iterator)  filtered_lists = data.mapPartition(filterOut2FromPartion) 
like image 193
bearrito Avatar answered Oct 05 '22 23:10

bearrito