How to unwrap nested Struct column into multiple columns?

Tags:

I'm trying to expand a DataFrame column with nested struct type (see below) to multiple columns. The Struct schema I'm working with looks something like {"foo": 3, "bar": {"baz": 2}}.

Ideally, I'd like to expand the above into two columns ("foo" and "bar.baz"). However, when I tried using .select("data.*") (where data is the Struct column), I only get columns foo and bar, where bar is still a struct.

Is there a way such that I can expand the Struct for both layers?

391

asked Oct 24 '17 14:10

Zz'Rot

2 Answers

You can select data.bar.baz as bar.baz:

df.show()
+-------+
|   data|
+-------+
|[3,[2]]|
+-------+

df.printSchema()
root
 |-- data: struct (nullable = false)
 |    |-- foo: long (nullable = true)
 |    |-- bar: struct (nullable = false)
 |    |    |-- baz: long (nullable = true)

In pyspark:

import pyspark.sql.functions as F
df.select(F.col("data.foo").alias("foo"), F.col("data.bar.baz").alias("bar.baz")).show()
+---+-------+
|foo|bar.baz|
+---+-------+
|  3|      2|
+---+-------+

145

answered Oct 17 '22 12:10

Psidom

I ended up going for the following function that recursively "unwraps" layered Struct's:

Essentially, it keeps digging into Struct fields and leave the other fields intact, and this approach eliminates the need to have a very long df.select(...) statement when the Struct has a lot of fields. Here's the code:

# Takes in a StructType schema object and return a column selector that flattens the Struct
def flatten_struct(schema, prefix=""):
    result = []
    for elem in schema:
        if isinstance(elem.dataType, StructType):
            result += flatten_struct(elem.dataType, prefix + elem.name + ".")
        else:
            result.append(col(prefix + elem.name).alias(prefix + elem.name))
    return result


df = sc.parallelize([Row(r=Row(a=1, b=Row(foo="b", bar="12")))]).toDF()
df.show()
+----------+
|         r|
+----------+
|[1,[12,b]]|
+----------+

df_expanded = df.select("r.*")
df_flattened = df_expanded.select(flatten_struct(df_expanded.schema))

df_flattened.show()
+---+-----+-----+
|  a|b.bar|b.foo|
+---+-----+-----+
|  1|   12|    b|
+---+-----+-----+

answered Oct 17 '22 11:10

Zz'Rot

Related questions
                            
                                Is there an C++ equivalent to Python's "import bigname as b"?
                            
                                What is the fastest way to initialize an integer array in python?
                            
                                Why the second time I run "readlines" on the same file nothing is returned?
                            
                                Multiplying polynomials in python
                            
                                board-drawing code to move an oval
                            
                                Whats the most pythonic way to calculate percentage changes on a list of numbers
                            
                                shorthand way to create dictionary key if it does not exist
                            
                                python "or" operator weird behavior
                            
                                What's the main difference between 'if' and 'else if'? [duplicate]
                            
                                Change 1's to 0 and 0's to 1 in numpy array without looping
                            
                                String separation in required format, Pythonic way? (with or w/o Regex)
                            
                                Read from socket: Is it guaranteed to at least get x bytes?
                            
                                Python singleton / object instantiation
                            
                                Why does Python's 'for ... in' work differently on a list of values vs. a list of dictionaries?
                            
                                Big picture questions regarding Django, Java, Python, HTML and web-site development in general [closed]
                            
                                Strange python for syntax, how does this work, whats it called?
                            
                                How to get formatted date time in python
                            
                                How can I find duplicate lines in a text file and print them? [closed]
                            
                                How to find the maximum number in a list using a loop?
                            
                                How do I fill NA values in multiple columns in pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to unwrap nested Struct column into multiple columns?

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Zz'Rot

People also ask

2 Answers

Psidom

Zz'Rot

Recent Activity

Donate For Us