I'm working through a Databricks example. The schema for the dataframe looks like:
> parquetDF.printSchema root |-- department: struct (nullable = true) | |-- id: string (nullable = true) | |-- name: string (nullable = true) |-- employees: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- firstName: string (nullable = true) | | |-- lastName: string (nullable = true) | | |-- email: string (nullable = true) | | |-- salary: integer (nullable = true)
In the example, they show how to explode the employees column into 4 additional columns:
val explodeDF = parquetDF.explode($"employees") { case Row(employee: Seq[Row]) => employee.map{ employee => val firstName = employee(0).asInstanceOf[String] val lastName = employee(1).asInstanceOf[String] val email = employee(2).asInstanceOf[String] val salary = employee(3).asInstanceOf[Int] Employee(firstName, lastName, email, salary) } }.cache() display(explodeDF)
How would I do something similar with the department column (i.e. add two additional columns to the dataframe called "id" and "name")? The methods aren't exactly the same, and I can only figure out how to create a brand new data frame using:
val explodeDF = parquetDF.select("department.id","department.name") display(explodeDF)
If I try:
val explodeDF = parquetDF.explode($"department") { case Row(dept: Seq[String]) => dept.map{dept => val id = dept(0) val name = dept(1) } }.cache() display(explodeDF)
I get the warning and error:
<console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure case Row(dept: Seq[String]) => dept.map{dept => ^ <console>:37: error: inferred type arguments [Unit] do not conform to method explode's type parameter bounds [A <: Product] val explodeDF = parquetDF.explode($"department") { ^
Solution: Spark explode function can be used to explode an Array of Struct ArrayType(StructType) columns to rows on Spark DataFrame using scala example. Before we start, let's create a DataFrame with Struct column in an array.
In my opinion the most elegant solution is to star expand a Struct using a select operator as shown below:
var explodedDf2 = explodedDf.select("department.*","*")
https://docs.databricks.com/spark/latest/spark-sql/complex-types.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With