PySpark create new column with mapping from a dict

Tags:

Using Spark 1.6, I have a Spark DataFrame column (named let's say col1) with values A, B, C, DS, DNS, E, F, G and H and I want to create a new column (say col2) with the values from the dict here below, how do I map this? (so f.i. 'A' needs to be mapped to 'S' etc..)

dict = {'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S', 'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}

830

asked Mar 23 '17 15:03

ad_s

2 Answers

Inefficient solution with UDF (version independent):

from pyspark.sql.types import StringType from pyspark.sql.functions import udf  def translate(mapping):     def translate_(col):         return mapping.get(col)     return udf(translate_, StringType())  df = sc.parallelize([('DS', ), ('G', ), ('INVALID', )]).toDF(['key']) mapping = {     'A': 'S', 'B': 'S', 'C': 'S', 'DS': 'S', 'DNS': 'S',      'E': 'NS', 'F': 'NS', 'G': 'NS', 'H': 'NS'}  df.withColumn("value", translate(mapping)("key"))

with the result:

+-------+-----+ |    key|value| +-------+-----+ |     DS|    S| |      G|   NS| |INVALID| null| +-------+-----+

Much more efficient (Spark >= 2.0, Spark < 3.0) is to create a MapType literal:

from pyspark.sql.functions import col, create_map, lit from itertools import chain  mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])  df.withColumn("value", mapping_expr.getItem(col("key")))

with the same result:

+-------+-----+ |    key|value| +-------+-----+ |     DS|    S| |      G|   NS| |INVALID| null| +-------+-----+

but more efficient execution plan:

== Physical Plan == *Project [key#15, keys: [B,DNS,DS,F,E,H,C,G,A], values: [S,S,S,NS,NS,NS,S,NS,S][key#15] AS value#53] +- Scan ExistingRDD[key#15]

compared to UDF version:

== Physical Plan == *Project [key#15, pythonUDF0#61 AS value#57] +- BatchEvalPython [translate_(key#15)], [key#15, pythonUDF0#61]    +- Scan ExistingRDD[key#15]

In Spark >= 3.0 getItem should be replaced with __getitem__ ([]), i.e:

df.withColumn("value", mapping_expr[col("key")]).show()

193

answered Sep 18 '22 14:09

zero323

Sounds like the simplest solution would be to use the replace function: http://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace

mapping= {         'A': '1',         'B': '2'     } df2 = df.replace(to_replace=mapping, subset=['yourColName'])

answered Sep 20 '22 14:09

Haim Bendanan

Related questions
                            
                                Removing duplicate strings from a list in python [duplicate]
                            
                                What is a clean way to convert a string percent to a float?
                            
                                In python, is there some kind of mapping to return the "False value" of a type?
                            
                                How do you convert YYYY-MM-DDTHH:mm:ss.000Z time format to MM/DD/YYYY time format in Python?
                            
                                How to make Facebook Login possible in Django app ?
                            
                                shortest python quine?
                            
                                Python - How to check if Redis server is available
                            
                                Getting standard errors on fitted parameters using the optimize.leastsq method in python
                            
                                Python design mistakes [closed]
                            
                                Add only unique values to a list in python
                            
                                Making sure that psycopg2 database connection alive
                            
                                Count letter frequency in word list, excluding duplicates in the same word
                            
                                Text difference algorithm
                            
                                How to render an ordered dictionary in django templates?
                            
                                How to plot confusion matrix with string axis rather than integer in python
                            
                                TypeError: 'builtin_function_or_method' object is not subscriptable
                            
                                How to Print next year from current year in Python
                            
                                Python: convert list to generator
                            
                                Python 2.7 : Write to file instantly
                            
                                How to create a list of objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark create new column with mapping from a dict

Tags:

python

dictionary

apache-spark

apache-spark-sql

pyspark

ad_s

People also ask

2 Answers

zero323

Haim Bendanan

Recent Activity

Donate For Us