I have a Spark 1.5.0 DataFrame with a mix of null
and empty strings in the same column. I want to convert all empty strings in all columns to null
(None
, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.
See my attempt below, which results in an error.
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) ## Create a test DataFrame testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')]) testDF.show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo| 1| ## | | 2| ## |null|null| ## +----+----+ ## Try to replace an empty string with None/null testDF.replace('', None).show() ## ValueError: value should be a float, int, long, string, list, or tuple ## A string value of null (obviously) doesn't work... testDF.replace('', 'null').na.drop(subset='col1').show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo| 1| ## |null| 2| ## +----+----+
You can use ISNULL() or COALESCE() to replace NULL with blanks. It's particularly important to use these functions while concatenating String in SQL Server because one NULL can turn all information into NULL.
You can replace blank/empty values with DataFrame. replace() methods. The replace() method replaces the specified value with another specified value on a specified column or on all columns of a DataFrame; replaces every case of the specified value.
In Spark, fill() function of DataFrameNaFunctions class is used to replace NULL values on the DataFrame column with either with zero(0), empty string, space, or any constant literal values.
The fillna() method replaces the NULL values with a specified value. The fillna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the fillna() method does the replacing in the original DataFrame instead.
It is as simple as this:
from pyspark.sql.functions import col, when def blank_as_null(x): return when(col(x) != "", col(x)).otherwise(None) dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1")) dfWithEmptyReplaced.show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo| 1| ## |null| 2| ## |null|null| ## +----+----+ dfWithEmptyReplaced.na.drop().show() ## +----+----+ ## |col1|col2| ## +----+----+ ## | foo| 1| ## +----+----+
If you want to fill multiple columns you can for example reduce:
to_convert = set([...]) # Some set of columns reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)
or use comprehension:
exprs = [ blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns] testDF.select(*exprs)
If you want to specifically operate on string fields please check the answer by robin-loxley.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With