I have a PySpark dataframe with a column results. Inside the results column I want to remove the column "Attributes".
The schema of the dataframe (there are more columns, but I have not shown them for convenience because the schema is large):
|-- results: struct (nullable = true)
| |-- l: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- m: struct (nullable = true)
| | | | |-- Attributes: struct (nullable = true)
| | | | | |-- m: struct (nullable = true)
| | | | | | |-- Score: struct (nullable = true)
| | | | | | | |-- n: string (nullable = true)
| | | | |-- OtherInfo: struct (nullable = true)
| | | | | |-- l: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- m: struct (nullable = true)
| | | | | | | | |-- Name: string (nullable = true)
How to do this without a udf in PySpark?
One row of data:
{
"results" : {
"l" : [
{
"m":{
"Attributes" : {
"m" : {
"Score" : {"n" : "85"}
}
},
"OtherInfo":{
"l" : [
{
"m" : {
"Name" : {"john"}
}
},
{
"m" : {
"Name" : "Cena"}
}
]
}
}
}
]
}
}
Since Spark 3.1, dropFields can be used without recreating the rest of a large schema.
df = df.withColumn(
"results",
F.struct(F.transform(
F.col("results.l"),
lambda x: x.m.dropFields("Attributes")
)).alias("l")
)
Result:
df.printSchema()
# root
# |-- results: struct (nullable = false)
# | |-- l: array (nullable = false)
# | | |-- element: struct (containsNull = false)
# | | | |-- OtherInfo: struct (nullable = false)
# | | | | |-- l: array (nullable = false)
# | | | | | |-- element: struct (containsNull = false)
# | | | | | | |-- m: struct (nullable = false)
# | | | | | | | |-- Name: string (nullable = true)
To delete a field from a struct type you have to create a new struct with all the elements but the one you want to delete from the original struct.
Here, as the field l under results is an array, you could use transform function (Spark 2.4+) to update all its struct elements like this:
from pyspark.sql.functions import struct, expr
t_expr = "transform(results.l, x -> struct(struct(x.m.OtherInfo as OtherInfo) as m))"
df = df.withColumn("results", struct(expr(t_expr).alias("l")))
For each element x in the array, we create new struct that holds only x.m.OtherInfo field.
df.printSchema()
#root
# |-- results: struct (nullable = false)
# | |-- l: array (nullable = true)
# | | |-- element: struct (containsNull = false)
# | | | |-- m: struct (nullable = false)
# | | | | |-- OtherInfo: struct (nullable = true)
# | | | | | |-- l: array (nullable = true)
# | | | | | | |-- element: struct (containsNull = true)
# | | | | | | | |-- m: struct (nullable = true)
# | | | | | | | | |-- Name: string (nullable = true)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With