Replace empty strings with None/null values in DataFrame

Tags:

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.

See my attempt below, which results in an error.

Click to copy

from pyspark.sql import SQLContext sqlContext = SQLContext(sc)  ## Create a test DataFrame testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')]) testDF.show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo|   1| ## |    |   2| ## |null|null| ## +----+----+  ## Try to replace an empty string with None/null testDF.replace('', None).show() ## ValueError: value should be a float, int, long, string, list, or tuple  ## A string value of null (obviously) doesn't work... testDF.replace('', 'null').na.drop(subset='col1').show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo|   1| ## |null|   2| ## +----+----+

842

asked Oct 22 '15 18:10

dnlbrky

1 Answers

It is as simple as this:

Click to copy

from pyspark.sql.functions import col, when  def blank_as_null(x):     return when(col(x) != "", col(x)).otherwise(None)  dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))  dfWithEmptyReplaced.show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo|   1| ## |null|   2| ## |null|null| ## +----+----+  dfWithEmptyReplaced.na.drop().show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo|   1| ## +----+----+

If you want to fill multiple columns you can for example reduce:

Click to copy

to_convert = set([...]) # Some set of columns  reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)

or use comprehension:

Click to copy

exprs = [     blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]  testDF.select(*exprs)

If you want to specifically operate on string fields please check the answer by robin-loxley.

answered Sep 27 '22 18:09

zero323

Related questions
                            
                                Python's interpretation of tabs and spaces to indent
                            
                                Why does "[] == False" evaluate to False when "if not []" succeeds?
                            
                                Difference between tkinter and Tkinter
                            
                                Trouble with TensorFlow in Jupyter Notebook
                            
                                SystemExit: 2 error when calling parse_args() within ipython
                            
                                Python - list transformation
                            
                                In Django, how do I select 100 random records from the database? [duplicate]
                            
                                How can I infinitely loop an iterator in Python, via a generator or other?
                            
                                Breakpoint-induced interactive debugging of Python with IPython
                            
                                How can I save a list of dictionaries to a file?
                            
                                Error on amazon SES: SendEmail operation: Illegal addres
                            
                                How to use select_for_update to 'get' a Query in Django?
                            
                                Django - Cannot create migrations for ImageField with dynamic upload_to value
                            
                                Django SMTPAuthenticationError
                            
                                Matplotlib, horizontal bar chart (barh) is upside-down
                            
                                pandas datetime to unix timestamp seconds
                            
                                How do I get the client IP of a Tornado request?
                            
                                What is a "code object" mentioned in this TypeError message?
                            
                                Accessing Python dict values with the key start characters
                            
                                How do I use multiple conditions with pyspark.sql.functions.when()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replace empty strings with None/null values in DataFrame

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

dnlbrky

People also ask

1 Answers

zero323

Recent Activity

Donate For Us