Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark/dataframe: replace null with empty space

I have the following udf function in pyspark dataframe. The code works fine except when myFun1('oldColumn') is null, I want the output to be empty string instead of null.

myFun1 = udf(lambda x: myModule.myFunction1(x), StringType())
myDF = myDF.withColumn('newColumn', myFun1('oldColumn'))

Is it possible to do this in place instead of create another udf function? Thanks!

like image 454
Edamame Avatar asked Feb 06 '23 13:02

Edamame


1 Answers

Using df.fillna() or df.na.fill() to replace null values with an empty string worked for me.

You can do replacements by column by supplying the column and value you want to replace nulls with as a parameter:

myDF = myDF.na.fill({'oldColumn': ''})

The Pyspark docs have an example :

>>> df4.na.fill({'age': 50, 'name': 'unknown'}).show()
+---+------+-------+
|age|height|   name|
+---+------+-------+
| 10|    80|  Alice|
|  5|  null|    Bob|
| 50|  null|    Tom|
| 50|  null|unknown|
+---+------+-------+
like image 133
scmz Avatar answered Feb 11 '23 16:02

scmz