I'm a newbie in PySpark. I have a Spark <code>DataFrame</code> <code>df</code> that has a column 'device_type'. I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop". In Python I can do the following, <pre class="prettyprint"><code>deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'} df['device_type'] = df['device_type'].replace(deviceDict,inplace=False) </code></pre> How can I achieve this using PySpark? Thanks!

You can use either <code>na.replace</code>: <pre class="prettyprint"><code>df = spark.createDataFrame([ ('Tablet', ), ('Phone', ), ('PC', ), ('Other', ), (None, ) ], ["device_type"]) df.na.replace(deviceDict, 1).show() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+-----------+ |device_type| +-----------+ | Mobile| | Mobile| | Desktop| | Other| | null| +-----------+ </code></pre> or map literal: <pre class="prettyprint"><code>from itertools import chain from pyspark.sql.functions import create_map, lit mapping = create_map([lit(x) for x in chain(*deviceDict.items())]) df.select(mapping[df['device_type']].alias('device_type')) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+-----------+ |device_type| +-----------+ | Mobile| | Mobile| | Desktop| | null| | null| +-----------+ </code></pre> Please note that the latter solution will convert values not present in the mapping to <code>NULL</code>. If this is not a desired behavior you can add <code>coalesce</code>: <pre class="prettyprint"><code>from pyspark.sql.functions import coalesce df.select( coalesce(mapping[df['device_type']], df['device_type']).alias('device_type') ) </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>+-----------+ |device_type| +-----------+ | Mobile| | Mobile| | Desktop| | Other| | null| +-----------+ </code></pre>

Pyspark: Replacing value in a column by searching a dictionary

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

I'm a newbie in PySpark.

I have a Spark DataFrame df that has a column 'device_type'.

I want to replace every value that is in "Tablet" or "Phone" to "Phone", and replace "PC" to "Desktop".

In Python I can do the following,

deviceDict = {'Tablet':'Mobile','Phone':'Mobile','PC':'Desktop'}
df['device_type'] = df['device_type'].replace(deviceDict,inplace=False)

How can I achieve this using PySpark? Thanks!

671

asked May 15 '17 09:05

Yuehan Lyu

1 Answers

You can use either na.replace:

df = spark.createDataFrame([
    ('Tablet', ), ('Phone', ),  ('PC', ), ('Other', ), (None, )
], ["device_type"])

df.na.replace(deviceDict, 1).show()

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|       null|
+-----------+

or map literal:

from itertools import chain
from pyspark.sql.functions import create_map, lit

mapping = create_map([lit(x) for x in chain(*deviceDict.items())])


df.select(mapping[df['device_type']].alias('device_type'))

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|       null|
|       null|
+-----------+

Please note that the latter solution will convert values not present in the mapping to NULL. If this is not a desired behavior you can add coalesce:

from pyspark.sql.functions import coalesce


df.select(
    coalesce(mapping[df['device_type']], df['device_type']).alias('device_type')
)

+-----------+
|device_type|
+-----------+
|     Mobile|
|     Mobile|
|    Desktop|
|      Other|
|       null|
+-----------+

answered Oct 12 '22 02:10

zero323

Related questions
                            
                                How to add elements to 3 dimensional array in python
                            
                                Python: most efficient way to convert date to datetime [duplicate]
                            
                                Pandas: Using Unix epoch timestamp as Datetime index
                            
                                python error 'dict' object has no attribute: 'add'
                            
                                Python list order
                            
                                OpenCV 3.1 drawContours '(-215) npoints > 0'
                            
                                Getting the date of the first day of the week
                            
                                How do I traverse and search a python dictionary?
                            
                                List of evented / asynchronous languages
                            
                                How do I override a parent class's functions in python?
                            
                                Converting a bash script to python (small script)
                            
                                Pythonic way to mix two lists
                            
                                open() function python default directory
                            
                                Building a GeoJSON with Python
                            
                                Using Django models in external python script
                            
                                Python - How to change autopct text color to be white in a pie chart?
                            
                                How to pythonically have partially-mutually exclusive optional arguments?
                            
                                Add response headers to flask web app
                            
                                How to exit a loop in Python? [closed]
                            
                                selecting second child in beautiful soup

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With