Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to give alias name for posexplode columns in Spark SQL?

The below statement generates "pos" and "col" as default names when I use posexplode() function in Spark SQL

scala> spark.sql(""" with t1(select to_date('2019-01-01') first_day) select first_day,date_sub(add_months(first_day,1),1) last_day, posexplode(array(5,6,7)) from t1 """).show(false)
+----------+----------+---+---+
|first_day |last_day  |pos|col|
+----------+----------+---+---+
|2019-01-01|2019-01-31|0  |5  |
|2019-01-01|2019-01-31|1  |6  |
|2019-01-01|2019-01-31|2  |7  |
+----------+----------+---+---+

What is the syntax to override those default names in spark.sql?. In dataframes, this can be done by giving df.explode(select 'arr.as(Seq("arr_val","arr_pos")))

scala> val arr= Array(5,6,7)
arr: Array[Int] = Array(5, 6, 7)

scala> Seq(("dummy")).toDF("x").select(posexplode(lit(arr)).as(Seq("arr_val","arr_pos"))).show(false)
+-------+-------+
|arr_val|arr_pos|
+-------+-------+
|0      |5      |
|1      |6      |
|2      |7      |
+-------+-------+

how to get that in SQL? I tried

spark.sql(""" with t1(select to_date('2011-01-01') first_day) select first_day,date_sub(add_months(first_day,1),1) last_day, posexplode(array(5,6,7)) as(Seq('p','c')) from t1 """).show(false)

and

spark.sql(""" with t1(select to_date('2011-01-01') first_day) select first_day,date_sub(add_months(first_day,1),1) last_day, posexplode(array(5,6,7)) as(('p','c')) from t1 """).show(false)

but they are throwing error.

like image 511
stack0114106 Avatar asked Jan 22 '19 13:01

stack0114106


People also ask

How to give alias names to columns or tables in spark?

Now let's see how to give alias names to columns or tables in Spark SQL. We will use alias () function with column names and table names. If you can recall the "SELECT" query from our previous post , we will add alias to the same query and see the output. scala> df_pres.select($"pres_id",$"pres_dob",$"pres_bs").show()

How to alias The columns produced by posexplode of an array?

The columns produced by posexplode of an array are named pos, and col by default, but can be aliased. You can also alias them using an alias tuple such as AS (myPos, myValue). The columns for maps are by default called pos, key and value. You can also alias them using an alias tuple such as AS (myPos, myKey, myValue).

What is an alias in SQL?

SQL aliases are used to give a table, or a column in a table, a temporary name. Aliases are often used to make column names more readable. An alias only exists for the duration of that query. An alias is created with the AS keyword. In this tutorial we will use the well-known Northwind sample database.

What is alias in pyspark?

PySpark Alias is a function in PySpark that is used to make a special signature for a column or table that is more often readable and shorter. We can alias more as a derived name for a Table or column in a PySpark Data frame / Data set. The aliasing gives access to the certain properties of the column/table which is being aliased to in PySpark.


1 Answers

You can either use LATERAL VIEW:

spark.sql("""
  WITH t1 AS (SELECT to_date('2011-01-01') first_day)
  SELECT first_day, date_sub(add_months(first_day,1),1) last_day, p, c
  FROM t1
  LATERAL VIEW  posexplode(array(5,6,7)) AS p, c
""").show
+----------+----------+---+---+
| first_day|  last_day|  p|  c|
+----------+----------+---+---+
|2011-01-01|2011-01-31|  0|  5|
|2011-01-01|2011-01-31|  1|  6|
|2011-01-01|2011-01-31|  2|  7|
+----------+----------+---+---+

or a tuple of aliases

spark.sql("""
  WITH t1 AS (SELECT to_date('2011-01-01') first_day)
  SELECT first_day, date_sub(add_months(first_day,1),1) last_day,
         posexplode(array(5,6,7)) AS (p, c) 
  FROM t1 
""").show
+----------+----------+---+---+
| first_day|  last_day|  p|  c|
+----------+----------+---+---+
|2011-01-01|2011-01-31|  0|  5|
|2011-01-01|2011-01-31|  1|  6|
|2011-01-01|2011-01-31|  2|  7|
+----------+----------+---+---+

Tested with Spark 2.4.0.

Please note that aliases are not strings, and shouldn't be quoted with ' or ". If you have to use non-standard identifiers you should use backticks, i.e.

WITH t1 AS (SELECT to_date('2011-01-01') first_day)
SELECT first_day, date_sub(add_months(first_day,1),1) last_day,
       posexplode(array(5,6,7)) AS (`arr pos`, `arr_value`) 
FROM t1 
like image 198
user10938362 Avatar answered Oct 17 '22 02:10

user10938362