Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assign a variable a dynamic value in SQL in Databricks / Spark

I feel like I must be missing something obvious here, but I can't seem to dynamically set a variable value in Spark SQL.

Let's say I have two tables, tableSrc and tableBuilder, and I'm creating tableDest.

I've been trying variants on

SET myVar FLOAT = NULL

SELECT
    myVar = avg(myCol)
FROM tableSrc;

CREATE TABLE tableDest(
    refKey INT,
    derivedValue FLOAT
);


INSERT INTO tableDest
    SELECT
        refKey,
        neededValue * myVar AS `derivedValue`
    FROM tableBuilder

Doing this in T-SQL is trivial, in a surprising win for Microsoft (DECLARE...SELECT). Spark, however, throws

Error in SQL statement: ParseException: mismatched input 'SELECT' expecting <EOF>(line 53, pos 0)

but I can't seem to assign a derived value to a variable for reuse. I tried a few variants, but the closest I got was assigning a variable to a string of a select statement.

Databricks Screenshot

Please note that this is being adapted from a fully functional script in T-SQL, and so I'd just as soon not split out the dozen or so SQL variables to compute all those variables with Python spark queries just to insert {var1}, {var2}, etc in a multi hundred line f-string. I know how to do this, but it will be messy, difficult, harder to read, slower to migrate, and worse to maintain and would like to avoid this if at all possible.

like image 275
Philip Kahn Avatar asked Dec 11 '19 00:12

Philip Kahn


People also ask

How do I declare a variable in spark SQL?

The short answer is no, Spark SQL does not support variables currently. The SQL Server uses T-SQL, which is based on SQL standard extended with procedure programming, local variables and other features. Spark SQL is a pure SQL, partially compatible with SQL standard.

How do you pass parameters to Databricks notebook?

If you are running a notebook from another notebook, then use dbutils. notebook. run(path = " ", args={}, timeout='120'), you can pass variables in args = {}. And you will use dbutils.


2 Answers

The SET command used is for spark.conf get/set, not a variable for SQL queries

For SQL queries you should use widgets:

https://docs.databricks.com/notebooks/widgets.html

But, there is a way of using spark.conf parameters on SQL:

%python spark.conf.set('personal.foo','bar')

Then you can use:

$sql select * from table where column = '${personal.foo}';

The trick part is that you have to use a "dot" (or other special character) on the name of the spark.conf, or SQL cells will expect you to provide value to the $variable on run time (It looks like a bug to me, i believe rounding with {} should be enough)

like image 118
Ronieri Marques Avatar answered Oct 23 '22 21:10

Ronieri Marques


Databricks just released SQL user defined functions, which can deal with the similar problem with no performance penalty, for your example it would look like:

CREATE TEMP FUNCTION myVar()
RETURNS FLOAT
LANGUAGE SQL
RETURN 
SELECT
    avg(myCol)
FROM tableSrc;

And then for use:

SELECT
      refKey,
      neededValue * myVar() AS `derivedValue`
FROM tableBuilder
like image 3
matkurek Avatar answered Oct 23 '22 21:10

matkurek