Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering a Pyspark DataFrame with SQL-like IN clause

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql('SELECT * from my_df WHERE field1 IN a') 

where a is the tuple (1, 2, 3). I am getting this error:

java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found

which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.

How would I filter in this case?

like image 610
mar tin Avatar asked Mar 08 '16 15:03

mar tin


People also ask

How do you use like in DataFrame PySpark?

you can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%s%'). The col('col_name') is used to represent the condition and like is the operator.

How do I check in conditions in PySpark?

In PySpark also use isin() function of PySpark Column Type to check the value of a DataFrame column present/exists in or not in the list of values. Use NOT operator (~) to negate the result of the isin() function in PySpark. These PySpark examples results in same output as above.

How do I filter a Dataframe in pyspark with multiple conditions?

PySpark Filter with Multiple Conditions In PySpark, to filter () rows on DataFrame based on multiple conditions, you case use either Column with a condition or SQL expression. Below is just a simple example using AND (&) condition, you can extend this with OR (|), and NOT (!) conditional expressions as needed. This yields below DataFrame results.

How to filter similar values using wildcard characters in pyspark?

If you have SQL background you must be familiar with like and rlike (regex like), PySpark also provides similar methods in Column class to filter similar values using wildcard characters. You can use rlike () to filter by checking values case insensitive.

How to filter columns with multiple conditions in a Dataframe?

filter (): It is a function which filters the columns/row based on SQL expression or condition. Example 2: Filter columns with multiple conditions. Here we are going to use the SQL col function, this function refers the column name of the dataframe with dataframe_object.col. Syntax: Dataframe_obj.col (column_name).

What is the parameter used by the like function in pyspark?

The parameter used by the like function is the character on which we want to filter the data. The LIKE operation is a simple expression that is used to find or manipulate any character in a PySpark SQL or data frame architecture. This takes up two special characters that can be further used up to match elements out there.


1 Answers

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v")) df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count() ##  2  

Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

In practice DataFrame DSL is a much better choice when you want to create dynamic queries:

from pyspark.sql.functions import col  df.where(col("v").isin({"foo", "bar"})).count() ## 2 

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

like image 161
zero323 Avatar answered Oct 13 '22 20:10

zero323