I want to filter a Pyspark DataFrame with a SQL-like IN
clause, as in
sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a
is the tuple (1, 2, 3)
. I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
you can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%s%'). The col('col_name') is used to represent the condition and like is the operator.
In PySpark also use isin() function of PySpark Column Type to check the value of a DataFrame column present/exists in or not in the list of values. Use NOT operator (~) to negate the result of the isin() function in PySpark. These PySpark examples results in same output as above.
PySpark Filter with Multiple Conditions In PySpark, to filter () rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example using AND (&) condition, you can extend this with OR (|), and NOT (!) conditional expressions as needed. This yields below DataFrame results.
If you have SQL background you must be familiar with like and rlike (regex like), PySpark also provides similar methods in Column class to filter similar values using wildcard characters. You can use rlike () to filter by checking values case insensitive.
filter (): It is a function which filters the columns/row based on SQL expression or condition. Example 2: Filter columns with multiple conditions. Here we are going to use the SQL col function, this function refers the column name of the dataframe with dataframe_object.col. Syntax: Dataframe_obj.col (column_name).
The parameter used by the like function is the character on which we want to filter the data. The LIKE operation is a simple expression that is used to find or manipulate any character in a PySpark SQL or data frame architecture. This takes up two special characters that can be further used up to match elements out there.
String you pass to SQLContext
it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v")) df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count() ## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame
DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col df.where(col("v").isin({"foo", "bar"})).count() ## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With