Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - SELECT WHERE or filtering?

What's the difference between selecting with a where clause and filtering in Spark?
Are there any use cases in which one is more appropriate than the other one?

When do I use

DataFrame newdf = df.select(df.col("*")).where(df.col("somecol").leq(10)) 

and when is

DataFrame newdf = df.select(df.col("*")).filter("somecol <= 10") 

more appropriate?

like image 574
lte__ Avatar asked Aug 10 '16 08:08

lte__


People also ask

What is the difference between where and filter in Spark?

Both 'filter' and 'where' in Spark SQL gives same result. There is no difference between the two. It's just filter is simply the standard Scala name for such a function, and where is for people who prefer SQL.

How do I filter multiple columns in Spark data frame?

Method 1: Using filter() Method filter() is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. We are going to filter the dataframe on multiple columns. It can take a condition and returns the dataframe.

What is the function of filter () in Spark?

In Spark, the Filter function returns a new dataset formed by selecting those elements of the source on which the function returns true. So, it retrieves only the elements that satisfy the given condition.

How do I select a specific row in Spark DataFrame?

Method 6: Using select() with collect() method This method is used to select a particular row from the dataframe, It can be used with collect() function. where, dataframe is the pyspark dataframe. Columns is the list of columns to be displayed in each row.


1 Answers

According to spark documentation "where() is an alias for filter()"

filter(condition) Filters rows using the given condition. where() is an alias for filter().

Parameters: condition – a Column of types.BooleanType or a string of SQL expression.

>>> df.filter(df.age > 3).collect() [Row(age=5, name=u'Bob')] >>> df.where(df.age == 2).collect() [Row(age=2, name=u'Alice')]  >>> df.filter("age > 3").collect() [Row(age=5, name=u'Bob')] >>> df.where("age = 2").collect() [Row(age=2, name=u'Alice')] 
like image 170
Yaron Avatar answered Sep 20 '22 02:09

Yaron