Filtering a Pyspark DataFrame with SQL-like IN clause

Tags:

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in

sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')

where a is the tuple (1, 2, 3). I am getting this error:

java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found

which is basically saying it was expecting something like '(1, 2, 3)' instead of a. The problem is I can't manually write the values in a as it's extracted from another job.

How would I filter in this case?

610

asked Mar 08 '16 15:03

mar tin

1 Answers

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v")) df.registerTempTable("df") sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count() ##  2

Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

In practice DataFrame DSL is a much better choice when you want to create dynamic queries:

from pyspark.sql.functions import col  df.where(col("v").isin({"foo", "bar"})).count() ## 2

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

161

answered Oct 13 '22 20:10

zero323

Related questions
                            
                                How to split long regular expression rules to multiple lines in Python
                            
                                Python 2.7 : LookupError: unknown encoding: cp65001 [duplicate]
                            
                                How to import all submodules?
                            
                                How to get PID by process name?
                            
                                What does Python mean by printing "[...]" for an object reference?
                            
                                Asynchronous context manager
                            
                                Using subprocess to run Python script on Windows
                            
                                Django edit form based on add form?
                            
                                How to import from config file in Flask?
                            
                                How to concatenate element-wise two lists in Python
                            
                                Python readlines() usage and efficient practice for reading
                            
                                Python 3 Get and parse JSON API
                            
                                Anaconda version with Python 3.5
                            
                                connect to a DB using psycopg2 without password
                            
                                Why does Python installed via Homebrew not include Tkinter
                            
                                Set specific DNS server using dns.resolver (pythondns)
                            
                                Range with step of type float [duplicate]
                            
                                range in jinja2 inside a for loop
                            
                                python + SQLAlchemy: deleting with the Session object
                            
                                Boolean Indexing with multiple conditions [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering a Pyspark DataFrame with SQL-like IN clause

Tags:

python

sql

dataframe

apache-spark

pyspark

mar tin

People also ask

1 Answers

zero323

Recent Activity

Donate For Us