Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flatten a struct in a Spark dataframe?

I have a dataframe with the following structure:

 |-- data: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- keyNote: struct (nullable = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- note: string (nullable = true)
 |    |-- details: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

How it is possible to flatten the structure and create a new dataframe:

     |-- id: long (nullable = true)
     |-- keyNote: struct (nullable = true)
     |    |-- key: string (nullable = true)
     |    |-- note: string (nullable = true)
     |-- details: map (nullable = true)
     |    |-- key: string
     |    |-- value: string (valueContainsNull = true)

Is there something like explode, but for structs?

like image 860
djWann Avatar asked Sep 30 '22 14:09

djWann


People also ask

How do you explode a struct in spark?

Problem: How to explode Array of StructType DataFrame columns to rows using Spark. Solution: Spark explode function can be used to explode an Array of Struct ArrayType(StructType) columns to rows on Spark DataFrame using scala example. Before we start, let's create a DataFrame with Struct column in an array.

What is flatten in spark?

In the Spark SQL, flatten function is a built-in function that is defined as a function to convert an Array of the Array column (nested array) that is ArrayanyType(ArrayanyType(StringType)) into the single array column on the Spark DataFrame. The Spark SQL is defined as the Spark module for structured data processing.

How do you convert struct to string in PySpark?

In order to convert array to a string, PySpark SQL provides a built-in function concat_ws() which takes delimiter of your choice as a first argument and array column (type Column) as the second argument. In order to use concat_ws() function, you need to import it using pyspark.


1 Answers

This should work in Spark 1.6 or later:

df.select(df.col("data.*"))

or

df.select(df.col("data.id"), df.col("data.keyNote"), df.col("data.details"))
like image 119
2 revsuser6022341 Avatar answered Oct 19 '22 01:10

2 revsuser6022341