Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark replace NaN with NULL

I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.

I tried something like this:

some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)

But I got the following error:

ValueError: value should be a float, int, long, string, bool or dict

So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.

like image 353
user554481 Avatar asked Jun 22 '18 17:06

user554481


People also ask

How do I change NaN to null in PySpark?

You can use the . replace function to change to null values in one line of code.

How do I remove NaN values from PySpark?

In order to remove Rows with NULL values on selected columns of PySpark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

How do I replace a string value with a null in PySpark?

In order to replace empty value with None/null on single DataFrame column, you can use withColumn() and when(). otherwise() function.

How do I specify null in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.


1 Answers

I finally found the answer after Googling around a bit.

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

import pyspark.sql.functions as F
columns = df.columns
for column in columns:
    df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))

sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()

+----+----+
|   a|   b|
+----+----+
|   1|null|
|null| 1.0|
+----+----+

It doesn't use na.fill(), but it accomplished the same result, so I'm happy.

like image 148
user554481 Avatar answered Nov 05 '22 23:11

user554481