Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL filter multiple fields

What is the corrent syntax for filtering on multiple columns in the Scala API? If I want to do something like this:

dataFrame.filter($"col01" === "something" && $"col02" === "something else")

or

dataFrame.filter($"col01" === "something" || $"col02" === "something else") 

EDIT:

This is what my original code looks like. Everything comes in as a string.

df.select($"userID" as "user", $"itemID" as "item", $"quantity" cast("int"), $"price" cast("float"), $"discount" cast ("float"), sqlf.substring($"datetime", 0, 10) as "date", $"group")
  .filter($"item" !== "" && $"group" !== "-1")
like image 259
gstvolvr Avatar asked Apr 27 '16 14:04

gstvolvr


People also ask

How do I filter two columns in PySpark?

Method 1: Using filter() Method filter() is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. We are going to filter the dataframe on multiple columns. It can take a condition and returns the dataframe.

How do you use multiple filters in PySpark?

PySpark Filter with Multiple Conditions In PySpark, to filter() rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example using AND (&) condition, you can extend this with OR(|), and NOT(!) conditional expressions as needed.

How do I filter specific columns in PySpark DataFrame?

Select Single & Multiple Columns From PySpark You can select the single or multiple columns of the DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with selected columns.

How do you subset in PySpark?

To subset or filter the data from the dataframe we are using the filter() function. The filter function is used to filter the data from the dataframe on the basis of the given condition it should be single or multiple. where df is the dataframe from which the data is subset or filtered.


1 Answers

I think i see what the issue is. For some reason, spark does not allow two !='s in the same filter. Need to look at how filter is defined in Spark source code.

Now for your code to work, you can use this to do the filter

df.filter(col("item").notEqual("") && col("group").notEqual("-1"))

or use two filters in same statement

df.filter($"item" !== "").filter($"group" !== "-1").select(....)

This link here can help with different spark methods.

like image 120
dheee Avatar answered Oct 13 '22 13:10

dheee