I am trying to run a subquery inside a case statement in Pyspark and it is throwing an exception. I am trying to create a new flag if id in one table is present in a different table. Can anyone please let me know if this is even possible in pyspark? <pre class="prettyprint"><code>temp_df=spark.sql("select *, case when key in (select distinct key from Ids) then 1 else 0 end as flag from main_table") </code></pre> Here is the error: <pre class="prettyprint"><code>AnalysisException: 'Predicate sub-queries can only be used in a Filter </code></pre>

This appears to be the latest detailed documentation regarding subqueries - it relates to Spark 2.0, but I haven't seen a major update in this area since then. The linked notebook in that reference makes it clear that indeed predicate subqueries are currently supported only within WHERE clauses. i.e. this would work (but of course would not yield the desired result): <pre class="prettyprint"><code>spark.sql("select * from main_table where id in (select distinct id from ids_table)") </code></pre> You could get the same result by using a left JOIN - that's what IN subqueries are generally translated into (for more details on that refer to the aforementioned linked notebook). For example: <pre class="prettyprint"><code># set up some data l1 = [('Alice', 1), ('Bob', 2), ('Eve', 3)] df1 = sql_sc.createDataFrame(l1, ['name', 'id']) l2 = [(1,), (2,)] df2 = sql_sc.createDataFrame(l2, ['id']) df1.createOrReplaceTempView("main_table") df2.createOrReplaceTempView("ids_table") # use a left join spark.sql("select * from main_table m left join ids_table d on (m.id=d.id)") \ .withColumn('flag', func.when(func.col('d.id').isNull(), 0).otherwise(1)) \ .drop('id').collect() # result: [Row(name='Bob', flag=1), Row(name='Eve', flag=0), Row(name='Alice', flag=1)] </code></pre> Or, using pyspark sql functions rather than sql syntax: <pre class="prettyprint"><code>df2 = df2.withColumnRenamed('id', 'id_faux') df1.join(df2, df1.id == df2.id_faux, how='left') \ .withColumn('flag', func.when(func.col('id_faux').isNull(), 0).otherwise(1)).drop('id_faux').collect() </code></pre>

Pyspark- Subquery in a case statement

Tags:

python

pyspark

pyspark-sql

I am trying to run a subquery inside a case statement in Pyspark and it is throwing an exception. I am trying to create a new flag if id in one table is present in a different table.

Can anyone please let me know if this is even possible in pyspark?

temp_df=spark.sql("select *, case when key in (select distinct key from Ids) then 1 else 0 end as flag from main_table")

Here is the error:

AnalysisException: 'Predicate sub-queries can only be used in a Filter

703

asked Mar 15 '18 00:03

kkumar

1 Answers

This appears to be the latest detailed documentation regarding subqueries - it relates to Spark 2.0, but I haven't seen a major update in this area since then.

The linked notebook in that reference makes it clear that indeed predicate subqueries are currently supported only within WHERE clauses. i.e. this would work (but of course would not yield the desired result):

spark.sql("select * from main_table where id in (select distinct id from ids_table)")

You could get the same result by using a left JOIN - that's what IN subqueries are generally translated into (for more details on that refer to the aforementioned linked notebook).

For example:

# set up some data
l1 = [('Alice', 1), ('Bob', 2), ('Eve', 3)]
df1 = sql_sc.createDataFrame(l1, ['name', 'id'])

l2 = [(1,), (2,)]
df2 = sql_sc.createDataFrame(l2, ['id'])

df1.createOrReplaceTempView("main_table")
df2.createOrReplaceTempView("ids_table")

# use a left join
spark.sql("select * from main_table m left join ids_table d on (m.id=d.id)") \
    .withColumn('flag', func.when(func.col('d.id').isNull(), 0).otherwise(1)) \ 
    .drop('id').collect()

# result:
[Row(name='Bob', flag=1), Row(name='Eve', flag=0), Row(name='Alice', flag=1)]

Or, using pyspark sql functions rather than sql syntax:

df2 = df2.withColumnRenamed('id', 'id_faux')
df1.join(df2, df1.id == df2.id_faux, how='left') \
     .withColumn('flag', func.when(func.col('id_faux').isNull(), 0).otherwise(1)).drop('id_faux').collect()

174

answered Oct 27 '22 20:10

etov

Related questions
                            
                                Python 3.6 Statistics module - NameError: name 'statistics' is not defined
                            
                                pandas cut multiple columns
                            
                                How to efficiently add multiple columns to pandas dataframe with values that depend on other columns
                            
                                What is the meaning of mu, loc and size in the scipy.stats.poisson?
                            
                                How to get shellscript filename without $0?
                            
                                Error enabling python-markdown extension for jupyter notebooks
                            
                                Extract class name in scrapy
                            
                                Dask delayed object of unspecified length not iterable error when combining dictionaries
                            
                                How can I call multiple views in one url address in Django?
                            
                                How to set a Tkinter widget to a monospaced, platform independent font?
                            
                                Python and C++ sharing the same memory resources
                            
                                Numpy: Fastest way to insert value into array such that array's in order
                            
                                Jupyter pandas.DataFrame output table format configuration
                            
                                pandas: Replicate / Broadcast single indexed DataFrame on MultiIndex DataFrame: HowTo and Memory Efficiency
                            
                                Is it safe that when Two asyncio tasks access the same awaitable object?
                            
                                Keras - .flow_from_directory(directory)
                            
                                Multinomial Logit model Python and Stata different results
                            
                                TypeError: unsupported operand type(s) for +: 'map' and 'float'
                            
                                Python add two sets and delete duplicate elements
                            
                                Error with opencv clahe.apply()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With