How does the pyspark mapPartitions function work?

Tags:

So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions work. That is what Input it takes and what Output it gives. I couldn't find any proper example from the internet. Lets say, I have an RDD object containing lists, such as below.

[ [1, 2, 3], [3, 2, 4], [5, 2, 7] ]

And I want to remove element 2 from all the lists, how would I achieve that using mapPartitions.

916

asked Nov 04 '14 17:11

MetallicPriest

1 Answers

mapPartition should be thought of as a map operation over partitions and not over the elements of the partition. It's input is the set of current partitions its output will be another set of partitions.

The function you pass to map operation must take an individual element of your RDD

The function you pass to mapPartition must take an iterable of your RDD type and return an iterable of some other or the same type.

In your case you probably just want to do something like:

def filter_out_2(line):     return [x for x in line if x != 2]  filtered_lists = data.map(filterOut2)

If you wanted to use mapPartition it would be:

def filter_out_2_from_partition(list_of_lists):   final_iterator = []   for sub_list in list_of_lists:     final_iterator.append( [x for x in sub_list if x != 2])   return iter(final_iterator)  filtered_lists = data.mapPartition(filterOut2FromPartion)

193

answered Oct 05 '22 23:10

bearrito

Related questions
                            
                                pandas filtering and comparing dates
                            
                                How to combine multiple rows into a single row with pandas [duplicate]
                            
                                How to split a .wav file into multiple .wav files?
                            
                                Key: value store in Python for possibly 100 GB of data, without client/server [closed]
                            
                                Access Python Development Server from External IP
                            
                                How do I get the name from a named tuple in python?
                            
                                ValueError: Dimension mismatch
                            
                                How can I send a signal from a python program?
                            
                                How to kill a running python process? [duplicate]
                            
                                How to properly create a pyinstaller hook, or maybe hidden import?
                            
                                convert series returned by pandas.Series.value_counts to a dictionary
                            
                                PyCharm Running Out of Memory
                            
                                Unzipping directory structure with python
                            
                                Best way to define multidimensional dictionaries in python? [duplicate]
                            
                                In python how to get name of a class inside its static method
                            
                                python: iterate a specific range in a list
                            
                                Pip Install -r continue past installs that fail
                            
                                Python dictionary in to html table
                            
                                Mocking __init__() for unittesting
                            
                                Scikit-learn is returning coefficient of determination (R^2) values less than -1

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does the pyspark mapPartitions function work?

Tags:

python

scala

apache-spark

MetallicPriest

People also ask

1 Answers

bearrito

Recent Activity

Donate For Us