Explode (transpose?) multiple columns in Spark SQL table

Tags:

I am using Spark SQL (I mention that it is in Spark in case that affects the SQL syntax - I'm not familiar enough to be sure yet) and I have a table that I am trying to re-structure, but I'm getting stuck trying to transpose multiple columns at the same time.

Basically I have data that looks like:

userId    someString      varA     varB    1      "example1"    [0,2,5]   [1,2,9]    2      "example2"    [1,20,5]  [9,null,6]

and I'd like to explode both varA and varB simultaneously (the length will always be consistent) - so that the final output looks like this:

userId    someString      varA     varB    1      "example1"       0         1    1      "example1"       2         2    1      "example1"       5         9    2      "example2"       1         9    2      "example2"       20       null    2      "example2"       5         6

but I can only seem to get a single explode(var) statement to work in one command, and if I try to chain them (ie create a temp table after the first explode command) then I obviously get a huge number of duplicate, unnecessary rows.

Many thanks!

353

asked Oct 19 '15 17:10

anthr

1 Answers

Spark >= 2.4

You can skip zip udf and use arrays_zip function:

df.withColumn("vars", explode(arrays_zip($"varA", $"varB"))).select(   $"userId", $"someString",   $"vars.varA", $"vars.varB").show

Spark < 2.4

What you want is not possible without a custom UDF. In Scala you could do something like this:

val data = sc.parallelize(Seq(     """{"userId": 1, "someString": "example1",         "varA": [0, 2, 5], "varB": [1, 2, 9]}""",     """{"userId": 2, "someString": "example2",         "varA": [1, 20, 5], "varB": [9, null, 6]}""" ))  val df = spark.read.json(data)  df.printSchema // root //  |-- someString: string (nullable = true) //  |-- userId: long (nullable = true) //  |-- varA: array (nullable = true) //  |    |-- element: long (containsNull = true) //  |-- varB: array (nullable = true) //  |    |-- element: long (containsNull = true)

Now we can define zip udf:

import org.apache.spark.sql.functions.{udf, explode}  val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))  df.withColumn("vars", explode(zip($"varA", $"varB"))).select(    $"userId", $"someString",    $"vars._1".alias("varA"), $"vars._2".alias("varB")).show  // +------+----------+----+----+ // |userId|someString|varA|varB| // +------+----------+----+----+ // |     1|  example1|   0|   1| // |     1|  example1|   2|   2| // |     1|  example1|   5|   9| // |     2|  example2|   1|   9| // |     2|  example2|  20|null| // |     2|  example2|   5|   6| // +------+----------+----+----+

With raw SQL:

sqlContext.udf.register("zip", (xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys)) df.registerTempTable("df")  sqlContext.sql(   """SELECT userId, someString, explode(zip(varA, varB)) AS vars FROM df""")

answered Sep 20 '22 04:09

zero323

Related questions
                            
                                What does the `width` field mean in PostgreSQL's EXPLAIN?
                            
                                How do you ADD and DROP columns in a single ALTER TABLE
                            
                                INSERT-OUTPUT including column from other table
                            
                                Find all Database Objects by Name?
                            
                                MySQL, Error 126: Incorrect key file for table
                            
                                Stack Overflow Related questions algorithm [closed]
                            
                                Visual Studio for SSRS 2008 - How to organize reports into subfolders in Solution Explorer?
                            
                                Eclipse: How to get TODOs from SQL and XML files in tasks
                            
                                Alter column default value
                            
                                Using Linq to SQL, how do I find min and max of a column in a table?
                            
                                How to define two relationships to the same table in SQLAlchemy
                            
                                Work around SQL Server maximum columns limit 1024 and 8kb record size
                            
                                In MySQL: How to pass a table name as stored procedure and/or function argument?
                            
                                Can I get the SQL string from a JPA Query object?
                            
                                Alter column in SQL Server
                            
                                NULL vs DEFAULT NULL vs NULL DEFAULT NULL in MYSQL column creation?
                            
                                Select "where clause" evaluation order
                            
                                How do I group on continuous ranges
                            
                                SQL Server - INNER JOIN WITH DISTINCT
                            
                                SQL: subquery has too many columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Explode (transpose?) multiple columns in Spark SQL table

Tags:

sql

apache-spark

apache-spark-sql

hiveql

anthr

People also ask

1 Answers

zero323

Recent Activity

Donate For Us