replace values of one column in a spark df by dictionary key-values (pyspark)

Tags:

I got stucked with a data transformation task in pyspark. I want to replace all values of one column in a df with key-value-pairs specified in a dictionary.

dict = {'A':1, 'B':2, 'C':3}

My df looks like this:

+-----------++-----------+
|       col1||       col2|
+-----------++-----------+
|          B||          A|
|          A||          A|
|          A||          A|
|          C||          B|
|          A||          A|
+-----------++-----------+

Now I want to replace all values of col1 by the key-values pairs defined in dict.

Desired Output:

+-----------++-----------+
|       col1||       col2|
+-----------++-----------+
|          2||          A|
|          1||          A|
|          1||          A|
|          3||          B|
|          1||          A|
+-----------++-----------+

I tried

df.na.replace(dict, 1).show()

but that also replaces the values on col2, which shall stay untouched.

Thank you for your help. Greetings :)

413

asked Jun 27 '17 09:06

getaway22

1 Answers

Your data:

print df
DataFrame[col1: string, col2: string]
df.show()   
+----+----+
|col1|col2|
+----+----+
|   B|   A|
|   A|   A|
|   A|   A|
|   C|   B|
|   A|   A|
+----+----+

diz = {"A":1, "B":2, "C":3}

Convert values of your dictionary from integer to string, in order to not get errors of replacing different types:

diz = {k:str(v) for k,v in zip(diz.keys(),diz.values())}

print diz
{'A': '1', 'C': '3', 'B': '2'}

Replace value of col1

df2 = df.na.replace(diz,1,"col1")
print df2
DataFrame[col1: string, col2: string]

df2.show()
+----+----+
|col1|col2|
+----+----+
|   2|   A|
|   1|   A|
|   1|   A|
|   3|   B|
|   1|   A|
+----+----+

If you need to cast your values from String to Integer

from pyspark.sql.types import *

df3 = df2.select(df2["col1"].cast(IntegerType()),df2["col2"]) 
print df3
DataFrame[col1: int, col2: string]

df3.show()
+----+----+
|col1|col2|
+----+----+
|   2|   A|
|   1|   A|
|   1|   A| 
|   3|   B|
|   1|   A|
+----+----+

169

answered Oct 23 '22 22:10

titiro89

Related questions
                            
                                Apache Spark Exception in thread "main" java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
                            
                                How can I change spark ui port?
                            
                                Spark ALS predictAll returns empty
                            
                                withColumn not allowing me to use max() function to generate a new column
                            
                                how to join two DataFrame and replace one column conditionally in spark
                            
                                How to append to a csv file using df.write.csv in pyspark?
                            
                                Spark SQL statement broadcast
                            
                                IF Statement Pyspark
                            
                                Configure standalone spark for azure storage access
                            
                                Scala Spark - illegal start of definition
                            
                                Difference in usecases for AWS Sagemaker vs Databricks?
                            
                                Why does a PySpark UDF that operates on a column generated by rand() fail?
                            
                                Spark does't run in Windows anymore
                            
                                Calling JDBC to impala/hive from within a spark job and creating a table
                            
                                Spark Cassandra connector - Range query on partition key
                            
                                NumPy exception when using MLlib even though Numpy is installed
                            
                                Spark Streaming Kafka stream
                            
                                What happens if I cache the same RDD twice in Spark
                            
                                Spark join throws 'function' object has no attribute '_get_object_id' error. How could I fix it?
                            
                                What is and how to control Memory Storage in Executors tab in web UI?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

replace values of one column in a spark df by dictionary key-values (pyspark)

Tags:

apache-spark

pyspark

spark-dataframe

getaway22

People also ask

1 Answers

titiro89

Recent Activity

Donate For Us