Pyspark replace NaN with NULL

Tags:

python

pyspark-sql

I use Spark to perform data transformations that I load into Redshift. Redshift does not support NaN values, so I need to replace all occurrences of NaN with NULL.

I tried something like this:

some_table = sql('SELECT * FROM some_table')
some_table = some_table.na.fill(None)

But I got the following error:

ValueError: value should be a float, int, long, string, bool or dict

So it seems like na.fill() doesn't support None. I specifically need to replace with NULL, not some other value, like 0.

353

asked Jun 22 '18 17:06

user554481

1 Answers

I finally found the answer after Googling around a bit.

df = spark.createDataFrame([(1, float('nan')), (None, 1.0)], ("a", "b"))
df.show()

+----+---+
|   a|  b|
+----+---+
|   1|NaN|
|null|1.0|
+----+---+

import pyspark.sql.functions as F
columns = df.columns
for column in columns:
    df = df.withColumn(column,F.when(F.isnan(F.col(column)),None).otherwise(F.col(column)))

sqlContext.registerDataFrameAsTable(df, "df2")
sql('select * from df2').show()

+----+----+
|   a|   b|
+----+----+
|   1|null|
|null| 1.0|
+----+----+

It doesn't use na.fill(), but it accomplished the same result, so I'm happy.

148

answered Nov 05 '22 23:11

user554481

Related questions
                            
                                How to disable Jinja2 for sections of template with {}?
                            
                                Add item into array if not already in array
                            
                                Adding +1 to a variable inside a function [duplicate]
                            
                                using regular expressions to exclude characters in a string search?
                            
                                When, if ever, to use the 'is' keyword in Python?
                            
                                How to get the MySQL type of error with PyMySQL?
                            
                                How to do JSON handler in Django
                            
                                numpy.fft() what is the return value amplitude + phase shift OR angle?
                            
                                Returning the URL's as a list from a YouTube search query [closed]
                            
                                django.core.exceptions.ImproperlyConfigured: The SECRET_KEY setting must not be empty
                            
                                Kerberos installation error, error: Setup script exited with error: command 'i686-linux-gnu-gcc' failed with exit status 1
                            
                                how to pass multiple parameters to class during initialization
                            
                                how to change image illumination in opencv python
                            
                                The current URL, app/, didn't match any of these
                            
                                How to install gnu gettext (>0.15) on windows? So I can produce .po/.mo files in Django
                            
                                Managing contents of requirements.txt for a Python virtual environment
                            
                                how to install python3-tk in centos?
                            
                                Pandas cumulative count [duplicate]
                            
                                How to find similar words with FastText?
                            
                                How can I split a Dataset from a .csv file for Training and Testing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark replace NaN with NULL

Tags:

python

pyspark-sql

user554481

People also ask

1 Answers

user554481

Recent Activity

Donate For Us