What's the difference between selecting with a where clause and filtering in Spark?
Are there any use cases in which one is more appropriate than the other one?
When do I use
DataFrame newdf = df.select(df.col("*")).where(df.col("somecol").leq(10))
and when is
DataFrame newdf = df.select(df.col("*")).filter("somecol <= 10")
more appropriate?
Both 'filter' and 'where' in Spark SQL gives same result. There is no difference between the two. It's just filter is simply the standard Scala name for such a function, and where is for people who prefer SQL.
Method 1: Using filter() Method filter() is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. We are going to filter the dataframe on multiple columns. It can take a condition and returns the dataframe.
In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.
Method 6: Using select() with collect() method This method is used to select a particular row from the dataframe, It can be used with collect() function. where, dataframe is the pyspark dataframe. Columns is the list of columns to be displayed in each row.
According to spark documentation "where()
is an alias for filter()
"
filter(condition)
Filters rows using the given condition. where()
is an alias for filter()
.
Parameters: condition – a Column
of types.BooleanType
or a string of SQL expression.
>>> df.filter(df.age > 3).collect() [Row(age=5, name=u'Bob')] >>> df.where(df.age == 2).collect() [Row(age=2, name=u'Alice')] >>> df.filter("age > 3").collect() [Row(age=5, name=u'Bob')] >>> df.where("age = 2").collect() [Row(age=2, name=u'Alice')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With