Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing Blank Strings from a Spark Dataframe

Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop() but it turns out many of these values are being encoded as "".

I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)

like image 302
mongolol Avatar asked Oct 10 '16 04:10

mongolol


People also ask

How do I remove blank lines in Spark?

In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

How do you filter blanks in PySpark?

In PySpark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking isNULL() of PySpark Column class. The above statements return all rows that have null values on the state column and the result is returned as the new DataFrame. All the above examples return the same output.

How do you replace blank values in PySpark?

In PySpark DataFrame use when(). otherwise() SQL functions to find out if a column has an empty value and use withColumn() transformation to replace a value of an existing column.

How do you handle NULL values in Spark?

In Spark, fill() function of DataFrameNaFunctions class is used to replace NULL values on the DataFrame column with either with zero(0), empty string, space, or any constant literal values.


2 Answers

Removing things from a dataframe requires filter().

newDF = oldDF.filter("colName != ''")

or am I misunderstanding your question?

like image 168
Kristian Avatar answered Oct 08 '22 11:10

Kristian


In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.

val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()
like image 21
Gaurav Khare Avatar answered Oct 08 '22 10:10

Gaurav Khare