Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exploding nested Struct in Spark dataframe

I'm working through a Databricks example. The schema for the dataframe looks like:

> parquetDF.printSchema root |-- department: struct (nullable = true) |    |-- id: string (nullable = true) |    |-- name: string (nullable = true) |-- employees: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- firstName: string (nullable = true) |    |    |-- lastName: string (nullable = true) |    |    |-- email: string (nullable = true) |    |    |-- salary: integer (nullable = true) 

In the example, they show how to explode the employees column into 4 additional columns:

val explodeDF = parquetDF.explode($"employees") {  case Row(employee: Seq[Row]) => employee.map{ employee =>   val firstName = employee(0).asInstanceOf[String]   val lastName = employee(1).asInstanceOf[String]   val email = employee(2).asInstanceOf[String]   val salary = employee(3).asInstanceOf[Int]   Employee(firstName, lastName, email, salary)  } }.cache() display(explodeDF) 

How would I do something similar with the department column (i.e. add two additional columns to the dataframe called "id" and "name")? The methods aren't exactly the same, and I can only figure out how to create a brand new data frame using:

val explodeDF = parquetDF.select("department.id","department.name") display(explodeDF) 

If I try:

val explodeDF = parquetDF.explode($"department") {    case Row(dept: Seq[String]) => dept.map{dept =>    val id = dept(0)    val name = dept(1)   }  }.cache() display(explodeDF) 

I get the warning and error:

<console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure             case Row(dept: Seq[String]) => dept.map{dept =>                             ^ <console>:37: error: inferred type arguments [Unit] do not conform to    method explode's type parameter bounds [A <: Product]   val explodeDF = parquetDF.explode($"department") {                                     ^ 
like image 911
Feynman27 Avatar asked Sep 01 '16 15:09

Feynman27


People also ask

How do you explode an array of struct in Spark?

Solution: Spark explode function can be used to explode an Array of Struct ArrayType(StructType) columns to rows on Spark DataFrame using scala example. Before we start, let's create a DataFrame with Struct column in an array.


1 Answers

In my opinion the most elegant solution is to star expand a Struct using a select operator as shown below:

var explodedDf2 = explodedDf.select("department.*","*") 

https://docs.databricks.com/spark/latest/spark-sql/complex-types.html

like image 159
DHARIN PAREKH Avatar answered Sep 24 '22 10:09

DHARIN PAREKH