Automatically and Elegantly flatten DataFrame in Spark SQL

Tags:

All,

Is there an elegant and accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested StructType

For example

If my schema is:

foo  |_bar  |_baz x y z

How do I select it into a flattened tabular form without resorting to manually running

df.select("foo.bar","foo.baz","x","y","z")

In other words, how do I obtain the result of the above code programmatically given just a StructType and a DataFrame

274

asked May 26 '16 21:05

echen

2 Answers

The short answer is, there's no "accepted" way to do this, but you can do it very elegantly with a recursive function that generates your select(...) statement by walking through the DataFrame.schema.

The recursive function should return an Array[Column]. Every time the function hits a StructType, it would call itself and append the returned Array[Column] to its own Array[Column].

Something like:

import org.apache.spark.sql.Column import org.apache.spark.sql.types.StructType import org.apache.spark.sql.functions.col  def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {   schema.fields.flatMap(f => {     val colName = if (prefix == null) f.name else (prefix + "." + f.name)      f.dataType match {       case st: StructType => flattenSchema(st, colName)       case _ => Array(col(colName))     }   }) }

You would then use it like this:

df.select(flattenSchema(df.schema):_*)

193

answered Oct 21 '22 21:10

David Griffin

Just wanted to share my solution for Pyspark - it's more or less a translation of @David Griffin's solution, so it supports any level of nested objects.

from pyspark.sql.types import StructType, ArrayType    def flatten(schema, prefix=None):     fields = []     for field in schema.fields:         name = prefix + '.' + field.name if prefix else field.name         dtype = field.dataType         if isinstance(dtype, ArrayType):             dtype = dtype.elementType          if isinstance(dtype, StructType):             fields += flatten(dtype, prefix=name)         else:             fields.append(name)      return fields   df.select(flatten(df.schema)).show()

answered Oct 21 '22 20:10

Evan V

Related questions
                            
                                Is there a brief syntax for executing a block n times in Scala?
                            
                                Cross product in Scala
                            
                                Fold and foldLeft method difference
                            
                                "eval" in Scala
                            
                                What's the status of Scala.React? [closed]
                            
                                Scala equivalent to Python generators?
                            
                                How to cancel Future in Scala?
                            
                                Is there an equivalent to SuppressWarnings in Scala?
                            
                                The type system in Scala is Turing complete. Proof? Example? Benefits?
                            
                                What is the difference between a class and a type in Scala (and Java)?
                            
                                Advantages of Scala's type system
                            
                                Why does a small change to this Scala code make such a huge difference to performance?
                            
                                Are there any provable real-world languages? (scala?)
                            
                                scala: How to pass an expanded list as varargs into a method?
                            
                                Can we use match to check the type of a class
                            
                                How can colored terminal output be disabled for sbt/play?
                            
                                Adding an item to an immutable Seq
                            
                                Scala UTC timestamp in seconds since January 1st, 1970
                            
                                Declare variable in a Play2 scala template
                            
                                How to generate a list of random numbers?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Automatically and Elegantly flatten DataFrame in Spark SQL

Tags:

scala

apache-spark

apache-spark-sql

echen

People also ask

2 Answers

David Griffin

Evan V

Recent Activity

Donate For Us