I am writing a map method using <pre class="prettyprint"><code>RDD.map(lambda line: my_method(line)) </code></pre> and based on a particular condition in my_method (let's say line starts with 'a'), I want to either return a particular value otherwise ignore this item all together. For now, I am returning -1 if the condition is not met on the item and later using another <pre class="prettyprint"><code>RDD.filter() method to remove all the ones with -1. </code></pre> Any better way to be able to ignore these items by returning null from my_method?

In case like this <code>flatMap</code> is your friend: <ol> <li> Adjust <code>my_method</code> so it returns either a single element list or an empty list (or create a wrapper like here What is the equivalent to scala.util.Try in pyspark?) <pre class="prettyprint"><code>def my_method(line): return [line.lower()] if line.startswith("a") else [] </code></pre> </li> <li> <code>flatMap</code> <pre class="prettyprint"><code>rdd = sc.parallelize(["aDSd", "CDd", "aCVED"]) rdd.flatMap(lambda line: my_method(line)).collect() ## ['adsd', 'acved'] </code></pre> </li> </ol>

How can I return an empty (null?) item back from a map method in PySpark?

Tags:

python

apache-spark

rdd

pyspark

I am writing a map method using

RDD.map(lambda line: my_method(line))

and based on a particular condition in my_method (let's say line starts with 'a'), I want to either return a particular value otherwise ignore this item all together.

For now, I am returning -1 if the condition is not met on the item and later using another

RDD.filter() method to remove all the ones with -1.

Any better way to be able to ignore these items by returning null from my_method?

517

asked Dec 15 '15 16:12

London guy

1 Answers

In case like this flatMap is your friend:

Adjust my_method so it returns either a single element list or an empty list (or create a wrapper like here What is the equivalent to scala.util.Try in pyspark?)
```
def my_method(line):
    return [line.lower()] if line.startswith("a") else []
```

flatMap

rdd = sc.parallelize(["aDSd", "CDd", "aCVED"])

rdd.flatMap(lambda line: my_method(line)).collect()
## ['adsd', 'acved']

183

answered Sep 25 '22 22:09

zero323

Related questions
                            
                                writing "dictionary of dictionaries" to .csv file in a particular format
                            
                                Python cassandra driver: Invalid or unsupported protocol version: 4
                            
                                Python: Check if Wikipedia Article Exists
                            
                                Is Python 3.5's grammar LL(1)?
                            
                                Python is 'key in dict' different/faster than 'key in dict.keys()' [duplicate]
                            
                                Pop up the window of exsiting view through click event in odoo
                            
                                call_command argument is required
                            
                                PySide: 'PySide.QtCore.Signal' object has no attribute 'emit'
                            
                                Gettings settings and config from INI file for Pyramid functional testing
                            
                                Cookiecutter created directory giving me issues running development server and python shell
                            
                                Get nth byte of integer
                            
                                IPython 4 shell does not work with Sublime REPL
                            
                                How to execute .sql file in spark using python
                            
                                Pandas adding extra row to DataFrame when assigning index
                            
                                Set metadata in Google Cloud Storage using gcloud-python
                            
                                Rearranging levels of a nested dictionary in python
                            
                                Storing and validating encrypted password for login in Pyramid
                            
                                When should I use fftshift(fft(fftshift(x))) and when fft(x)?
                            
                                Calculating 16-bit integer value from two 8-bit integers?
                            
                                Marshmallow not giving errors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With