Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace empty strings with None/null values in DataFrame

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.

See my attempt below, which results in an error.

from pyspark.sql import SQLContext sqlContext = SQLContext(sc)  ## Create a test DataFrame testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')]) testDF.show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo|   1| ## |    |   2| ## |null|null| ## +----+----+  ## Try to replace an empty string with None/null testDF.replace('', None).show() ## ValueError: value should be a float, int, long, string, list, or tuple  ## A string value of null (obviously) doesn't work... testDF.replace('', 'null').na.drop(subset='col1').show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo|   1| ## |null|   2| ## +----+----+ 
like image 842
dnlbrky Avatar asked Oct 22 '15 18:10

dnlbrky


People also ask

How do I replace an empty string to NULL?

You can use ISNULL() or COALESCE() to replace NULL with blanks. It's particularly important to use these functions while concatenating String in SQL Server because one NULL can turn all information into NULL.

How do you replace a blank string in a DataFrame?

You can replace blank/empty values with DataFrame. replace() methods. The replace() method replaces the specified value with another specified value on a specified column or on all columns of a DataFrame; replaces every case of the specified value.

How replace Blank NULL with blank in DataFrame?

In Spark, fill() function of DataFrameNaFunctions class is used to replace NULL values on the DataFrame column with either with zero(0), empty string, space, or any constant literal values.

How do you replace NULL values in a DataFrame in Python?

The fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.


1 Answers

It is as simple as this:

from pyspark.sql.functions import col, when  def blank_as_null(x):     return when(col(x) != "", col(x)).otherwise(None)  dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))  dfWithEmptyReplaced.show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo|   1| ## |null|   2| ## |null|null| ## +----+----+  dfWithEmptyReplaced.na.drop().show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo|   1| ## +----+----+ 

If you want to fill multiple columns you can for example reduce:

to_convert = set([...]) # Some set of columns  reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF) 

or use comprehension:

exprs = [     blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]  testDF.select(*exprs) 

If you want to specifically operate on string fields please check the answer by robin-loxley.

like image 73
zero323 Avatar answered Sep 27 '22 18:09

zero323