The below statement generates "pos" and "col" as default names when I use posexplode()
function in Spark SQL
scala> spark.sql(""" with t1(select to_date('2019-01-01') first_day) select first_day,date_sub(add_months(first_day,1),1) last_day, posexplode(array(5,6,7)) from t1 """).show(false)
+----------+----------+---+---+
|first_day |last_day |pos|col|
+----------+----------+---+---+
|2019-01-01|2019-01-31|0 |5 |
|2019-01-01|2019-01-31|1 |6 |
|2019-01-01|2019-01-31|2 |7 |
+----------+----------+---+---+
What is the syntax to override those default names in spark.sql?.
In dataframes, this can be done by giving df.explode(select 'arr.as(Seq("arr_val","arr_pos")))
scala> val arr= Array(5,6,7)
arr: Array[Int] = Array(5, 6, 7)
scala> Seq(("dummy")).toDF("x").select(posexplode(lit(arr)).as(Seq("arr_val","arr_pos"))).show(false)
+-------+-------+
|arr_val|arr_pos|
+-------+-------+
|0 |5 |
|1 |6 |
|2 |7 |
+-------+-------+
how to get that in SQL? I tried
spark.sql(""" with t1(select to_date('2011-01-01') first_day) select first_day,date_sub(add_months(first_day,1),1) last_day, posexplode(array(5,6,7)) as(Seq('p','c')) from t1 """).show(false)
and
spark.sql(""" with t1(select to_date('2011-01-01') first_day) select first_day,date_sub(add_months(first_day,1),1) last_day, posexplode(array(5,6,7)) as(('p','c')) from t1 """).show(false)
but they are throwing error.
Now let's see how to give alias names to columns or tables in Spark SQL. We will use alias () function with column names and table names. If you can recall the "SELECT" query from our previous post , we will add alias to the same query and see the output. scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").show()
The columns produced by posexplode of an array are named pos, and col by default, but can be aliased. You can also alias them using an alias tuple such as AS (myPos, myValue). The columns for maps are by default called pos, key and value. You can also alias them using an alias tuple such as AS (myPos, myKey, myValue).
SQL aliases are used to give a table, or a column in a table, a temporary name. Aliases are often used to make column names more readable. An alias only exists for the duration of that query. An alias is created with the AS keyword. In this tutorial we will use the well-known Northwind sample database.
PySpark Alias is a function in PySpark that is used to make a special signature for a column or table that is more often readable and shorter. We can alias more as a derived name for a Table or column in a PySpark Data frame / Data set. The aliasing gives access to the certain properties of the column/table which is being aliased to in PySpark.
You can either use LATERAL VIEW
:
spark.sql("""
WITH t1 AS (SELECT to_date('2011-01-01') first_day)
SELECT first_day, date_sub(add_months(first_day,1),1) last_day, p, c
FROM t1
LATERAL VIEW posexplode(array(5,6,7)) AS p, c
""").show
+----------+----------+---+---+
| first_day| last_day| p| c|
+----------+----------+---+---+
|2011-01-01|2011-01-31| 0| 5|
|2011-01-01|2011-01-31| 1| 6|
|2011-01-01|2011-01-31| 2| 7|
+----------+----------+---+---+
or a tuple of aliases
spark.sql("""
WITH t1 AS (SELECT to_date('2011-01-01') first_day)
SELECT first_day, date_sub(add_months(first_day,1),1) last_day,
posexplode(array(5,6,7)) AS (p, c)
FROM t1
""").show
+----------+----------+---+---+
| first_day| last_day| p| c|
+----------+----------+---+---+
|2011-01-01|2011-01-31| 0| 5|
|2011-01-01|2011-01-31| 1| 6|
|2011-01-01|2011-01-31| 2| 7|
+----------+----------+---+---+
Tested with Spark 2.4.0.
Please note that aliases are not strings, and shouldn't be quoted with '
or "
. If you have to use non-standard identifiers you should use backticks, i.e.
WITH t1 AS (SELECT to_date('2011-01-01') first_day)
SELECT first_day, date_sub(add_months(first_day,1),1) last_day,
posexplode(array(5,6,7)) AS (`arr pos`, `arr_value`)
FROM t1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With