I'm working through a Databricks example. The schema for the dataframe looks like: <pre class="prettyprint"><code>> parquetDF.printSchema root |-- department: struct (nullable = true) | |-- id: string (nullable = true) | |-- name: string (nullable = true) |-- employees: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- firstName: string (nullable = true) | | |-- lastName: string (nullable = true) | | |-- email: string (nullable = true) | | |-- salary: integer (nullable = true) </code></pre> In the example, they show how to explode the employees column into 4 additional columns: <pre class="prettyprint"><code>val explodeDF = parquetDF.explode($"employees") { case Row(employee: Seq[Row]) => employee.map{ employee => val firstName = employee(0).asInstanceOf[String] val lastName = employee(1).asInstanceOf[String] val email = employee(2).asInstanceOf[String] val salary = employee(3).asInstanceOf[Int] Employee(firstName, lastName, email, salary) } }.cache() display(explodeDF) </code></pre> How would I do something similar with the department column (i.e. add two additional columns to the dataframe called "id" and "name")? The methods aren't exactly the same, and I can only figure out how to create a brand new data frame using: <pre class="prettyprint"><code>val explodeDF = parquetDF.select("department.id","department.name") display(explodeDF) </code></pre> If I try: <pre class="prettyprint"><code>val explodeDF = parquetDF.explode($"department") { case Row(dept: Seq[String]) => dept.map{dept => val id = dept(0) val name = dept(1) } }.cache() display(explodeDF) </code></pre> I get the warning and error: <pre class="prettyprint"><code><console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure case Row(dept: Seq[String]) => dept.map{dept => ^ <console>:37: error: inferred type arguments [Unit] do not conform to method explode's type parameter bounds [A <: Product] val explodeDF = parquetDF.explode($"department") { ^ </code></pre>

In my opinion the most elegant solution is to star expand a Struct using a select operator as shown below: <pre class="prettyprint"><code>var explodedDf2 = explodedDf.select("department.*","*") </code></pre> https://docs.databricks.com/spark/latest/spark-sql/complex-types.html

Exploding nested Struct in Spark dataframe

I'm working through a Databricks example. The schema for the dataframe looks like:

> parquetDF.printSchema root |-- department: struct (nullable = true) |    |-- id: string (nullable = true) |    |-- name: string (nullable = true) |-- employees: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- firstName: string (nullable = true) |    |    |-- lastName: string (nullable = true) |    |    |-- email: string (nullable = true) |    |    |-- salary: integer (nullable = true)

In the example, they show how to explode the employees column into 4 additional columns:

val explodeDF = parquetDF.explode($"employees") {  case Row(employee: Seq[Row]) => employee.map{ employee =>   val firstName = employee(0).asInstanceOf[String]   val lastName = employee(1).asInstanceOf[String]   val email = employee(2).asInstanceOf[String]   val salary = employee(3).asInstanceOf[Int]   Employee(firstName, lastName, email, salary)  } }.cache() display(explodeDF)

How would I do something similar with the department column (i.e. add two additional columns to the dataframe called "id" and "name")? The methods aren't exactly the same, and I can only figure out how to create a brand new data frame using:

val explodeDF = parquetDF.select("department.id","department.name") display(explodeDF)

If I try:

val explodeDF = parquetDF.explode($"department") {    case Row(dept: Seq[String]) => dept.map{dept =>    val id = dept(0)    val name = dept(1)   }  }.cache() display(explodeDF)

I get the warning and error:

<console>:38: warning: non-variable type argument String in type pattern Seq[String] is unchecked since it is eliminated by erasure             case Row(dept: Seq[String]) => dept.map{dept =>                             ^ <console>:37: error: inferred type arguments [Unit] do not conform to    method explode's type parameter bounds [A <: Product]   val explodeDF = parquetDF.explode($"department") {                                     ^

How do you explode an array of struct in Spark?

Solution: Spark explode function can be used to explode an Array of Struct ArrayType(StructType) columns to rows on Spark DataFrame using scala example. Before we start, let's create a DataFrame with Struct column in an array.

In my opinion the most elegant solution is to star expand a Struct using a select operator as shown below:

var explodedDf2 = explodedDf.select("department.*","*")

https://docs.databricks.com/spark/latest/spark-sql/complex-types.html

Exploding nested Struct in Spark dataframe

Tags:

scala

distributed-computing

apache-spark

apache-spark-sql

databricks

Feynman27

People also ask

1 Answers

DHARIN PAREKH

Recent Activity

Donate For Us

Exploding nested Struct in Spark dataframe

Tags:

scala

distributed-computing

apache-spark

apache-spark-sql

databricks

Feynman27

People also ask

1 Answers

DHARIN PAREKH

Related questions

Recent Activity

Donate For Us