Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between na().drop() and filter(col.isNotNull) (Apache Spark)

Is there any difference in semantics between df.na().drop() and df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN()) where df is Apache Spark Dataframe?

Or shall I consider it as a bug if the first one does NOT return afterwards null (not a String null, but simply a null value) in the column onlyColumnInOneColumnDataFrame and the second one does?

EDIT: added !isNaN() as well. The onlyColumnInOneColumnDataFrame is the only column in the given Dataframe. Let's say it's type is Integer.

like image 773
JiriS Avatar asked Feb 18 '16 09:02

JiriS


People also ask

What is the function of filter () in Spark?

In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.

What is the difference between where and filter in Spark?

Both 'filter' and 'where' in Spark SQL gives same result. There is no difference between the two. It's just filter is simply the standard Scala name for such a function, and where is for people who prefer SQL.

How do I drop a NULL column in Spark DataFrame?

In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.

How do I filter NULL values in Spark DataFrame?

In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.


1 Answers

With df.na.drop() you drop the rows containing any null or NaN values.

With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull()) you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame.

If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"]).

like image 57
Daniel Zolnai Avatar answered Oct 11 '22 20:10

Daniel Zolnai