Is there any difference in semantics between df.na().drop()
and df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN())
where df
is Apache Spark Dataframe
?
Or shall I consider it as a bug if the first one does NOT return afterwards null
(not a String null, but simply a null
value) in the column onlyColumnInOneColumnDataFrame
and the second one does?
EDIT: added !isNaN()
as well. The onlyColumnInOneColumnDataFrame
is the only column in the given Dataframe
. Let's say it's type is Integer
.
In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.
Both 'filter' and 'where' in Spark SQL gives same result. There is no difference between the two. It's just filter is simply the standard Scala name for such a function, and where is for people who prefer SQL.
In order to remove Rows with NULL values on selected columns of Spark DataFrame, use drop(columns:Seq[String]) or drop(columns:Array[String]). To these functions pass the names of the columns you wanted to check for NULL values to delete rows.
In Spark, using filter() or where() functions of DataFrame we can filter rows with NULL values by checking IS NULL or isNULL . These removes all rows with null values on state column and returns the new DataFrame. All above examples returns the same output.
With df.na.drop()
you drop the rows containing any null or NaN values.
With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull())
you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame
.
If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"])
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With