I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL. Let's say I have two tables, <code>tableSrc</code> and <code>tableBuilder</code>, and I'm creating <code>tableDest</code>. I've been trying variants on <pre class="prettyprint lang-sql prettyprint-override"><code>SET myVar FLOAT = NULL SELECT myVar = avg(myCol) FROM tableSrc; CREATE TABLE tableDest( refKey INT, derivedValue FLOAT ); INSERT INTO tableDest SELECT refKey, neededValue * myVar AS `derivedValue` FROM tableBuilder </code></pre> Doing this in T-SQL is trivial, in a surprising win for Microsoft (<code>DECLARE</code>...<code>SELECT</code>). Spark, however, throws <code>Error in SQL statement: ParseException: mismatched input 'SELECT' expecting <EOF>(line 53, pos 0)</code> but I can't seem to assign a derived value to a variable for reuse. I tried a few variants, but the closest I got was assigning a variable to a string of a select statement. <img src="https://i.stack.imgur.com/MNLXx.png" alt="Databricks Screenshot"> Please note that this is being adapted from a fully functional script in T-SQL, and so I'd just as soon not split out the dozen or so SQL variables to compute all those variables with Python spark queries just to insert <code>{var1}</code>, <code>{var2}</code>, etc in a multi hundred line f-string. I know how to do this, but it will be messy, difficult, harder to read, slower to migrate, and worse to maintain and would like to avoid this if at all possible.

The SET command used is for spark.conf get/set, not a variable for SQL queries For SQL queries you should use widgets: https://docs.databricks.com/notebooks/widgets.html But, there is a way of using spark.conf parameters on SQL: <code>%python spark.conf.set('personal.foo','bar')</code> Then you can use: <code>$sql select * from table where column = '${personal.foo}';</code> The trick part is that you have to use a "dot" (or other special character) on the name of the spark.conf, or SQL cells will expect you to provide value to the $variable on run time (It looks like a bug to me, i believe rounding with {} should be enough)

Databricks just released SQL user defined functions, which can deal with the similar problem with no performance penalty, for your example it would look like: <pre class="prettyprint"><code>CREATE TEMP FUNCTION myVar() RETURNS FLOAT LANGUAGE SQL RETURN SELECT avg(myCol) FROM tableSrc; </code></pre> And then for use: <pre class="prettyprint"><code>SELECT refKey, neededValue * myVar() AS `derivedValue` FROM tableBuilder </code></pre>

Assign a variable a dynamic value in SQL in Databricks / Spark

Tags:

apache-spark

apache-spark-sql

pyspark-sql

databricks

I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL.

Let's say I have two tables, tableSrc and tableBuilder, and I'm creating tableDest.

I've been trying variants on

SET myVar FLOAT = NULL

SELECT
    myVar = avg(myCol)
FROM tableSrc;

CREATE TABLE tableDest(
    refKey INT,
    derivedValue FLOAT
);


INSERT INTO tableDest
    SELECT
        refKey,
        neededValue * myVar AS `derivedValue`
    FROM tableBuilder

Doing this in T-SQL is trivial, in a surprising win for Microsoft (DECLARE...SELECT). Spark, however, throws

Error in SQL statement: ParseException: mismatched input 'SELECT' expecting <EOF>(line 53, pos 0)

but I can't seem to assign a derived value to a variable for reuse. I tried a few variants, but the closest I got was assigning a variable to a string of a select statement.

Databricks Screenshot

Please note that this is being adapted from a fully functional script in T-SQL, and so I'd just as soon not split out the dozen or so SQL variables to compute all those variables with Python spark queries just to insert {var1}, {var2}, etc in a multi hundred line f-string. I know how to do this, but it will be messy, difficult, harder to read, slower to migrate, and worse to maintain and would like to avoid this if at all possible.

275

asked Dec 11 '19 00:12

Philip Kahn

2 Answers

The SET command used is for spark.conf get/set, not a variable for SQL queries

For SQL queries you should use widgets:

https://docs.databricks.com/notebooks/widgets.html

But, there is a way of using spark.conf parameters on SQL:

%python spark.conf.set('personal.foo','bar')

Then you can use:

$sql select * from table where column = '${personal.foo}';

The trick part is that you have to use a "dot" (or other special character) on the name of the spark.conf, or SQL cells will expect you to provide value to the $variable on run time (It looks like a bug to me, i believe rounding with {} should be enough)

118

answered Oct 23 '22 21:10

Ronieri Marques

Databricks just released SQL user defined functions, which can deal with the similar problem with no performance penalty, for your example it would look like:

CREATE TEMP FUNCTION myVar()
RETURNS FLOAT
LANGUAGE SQL
RETURN 
SELECT
    avg(myCol)
FROM tableSrc;

And then for use:

SELECT
      refKey,
      neededValue * myVar() AS `derivedValue`
FROM tableBuilder

answered Oct 23 '22 21:10

matkurek

Related questions
                            
                                Spark: Expansion of RDD(Key, List) to RDD(Key, Value)
                            
                                Access Spark broadcast variable in different classes
                            
                                How to normalize or standardize the data having multiple columns/variables in spark using scala?
                            
                                Apache Spark writing to s3 failing to move parquet files from temporary folder
                            
                                Scala: Spark SQL to_date(unix_timestamp) returning NULL
                            
                                How to get the difference between two RDDs in PySpark?
                            
                                Tuple to data frame in spark scala
                            
                                How Spark RDD partitions are processed if no. of executors < no. of RDD partition
                            
                                Spark create UDF that doesn't take in input
                            
                                How to deal with Spark UDF input/output of primitive nullable type
                            
                                In spark, how to estimate the number of elements in a dataframe quickly
                            
                                Define return value in Spark Scala UDF
                            
                                Spark from_json - StructType and ArrayType
                            
                                Set thresholds in PySpark multinomial logistic regression
                            
                                PySpark Boolean Pivot
                            
                                Spark Structured Streaming Multiple WriteStreams to Same Sink
                            
                                How to get today - “6 months” date in PySpark(SQL) [duplicate]
                            
                                Generating monthly timestamps between two dates in pyspark dataframe
                            
                                Efficient pyspark join
                            
                                PySpark: filtering with isin returns empty dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With