Given a schema like:
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisors: struct
| | | |-- advisor1: string
| | | |-- advisor2: string
How can I get a schema like:
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisor1: string
| | |-- advisor2: string
Currently, I explode the array, flatten the structure by selecting advisor.* and then group by first_name, last_name and rebuild the array with collect_list. I'm hoping there's a cleaner/shorter way to do this. Currently, there's a lot of pain renaming some fields and stuff that I don't want to get into here. Thanks!
You can use udf to change the datatype of nested columns in dataframe. Suppose you have read the dataframe as df1
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def foo(data):
return
(
list(map(
lambda x: (
x["school"],
x["advisors"]["advisor1"],
x["advisors"]["advisor1"]
),
data
))
)
struct = ArrayType(
StructType([
StructField("school", StringType()),
StructField("advisor1", StringType()),
StructField("advisor2", StringType())
])
)
udf_foo = udf(foo, struct)
df2 = df1.withColumn("degrees", udf_foo("degrees"))
df2.printSchema()
output:
root
|-- degrees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- school: string (nullable = true)
| | |-- advisor1: string (nullable = true)
| | |-- advisor2: string (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With