Flatten Nested Spark Dataframe

Question

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc).

Say I have a schema like:

StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...)))

Looking to adapt this into a flat table with a structure like:

field1
field2
nested_array.nested_field1
nested_array.nested_field2

FYI, looking for suggestions for Pyspark, but other flavors of Spark are also appreciated.

MaFF · Accepted Answer

This issue might be a bit old, but for anyone out there still looking for a solution you can flatten complex data types inline using select *:

first let's create the nested dataframe:

from pyspark.sql import HiveContext
hc = HiveContext(sc)
nested_df = hc.read.json(sc.parallelize(["""
{
  "field1": 1, 
  "field2": 2, 
  "nested_array":{
     "nested_field1": 3,
     "nested_field2": 4
  }
}
"""]))

now to flatten it:

flat_df = nested_df.select("field1", "field2", "nested_array.*")

You'll find useful examples here: https://docs.databricks.com/delta/data-transformation/complex-types.html

If you have too many nested arrays, you can use:

flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(*flat_cols, *[c + ".*" for c in nested_cols])

Flatten Nested Spark Dataframe

Tags:

apache-spark

pyspark

spark-dataframe

John

1 Answers

MaFF

Recent Activity

Donate For Us

Flatten Nested Spark Dataframe

Tags:

apache-spark

pyspark

spark-dataframe

John

1 Answers

MaFF

Related questions

Recent Activity

Donate For Us