Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove nested column in PySpark

I have a PySpark dataframe with a column results. Inside the results column I want to remove the column "Attributes".

The schema of the dataframe (there are more columns, but I have not shown them for convenience because the schema is large):

 |-- results: struct (nullable = true)
 |    |-- l: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- Attributes: struct (nullable = true)
 |    |    |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |    |    |-- Score: struct (nullable = true)
 |    |    |    |    |    |    |    |-- n: string (nullable = true)
 |    |    |    |    |-- OtherInfo: struct (nullable = true)
 |    |    |    |    |    |-- l: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |    |    |    |    |-- Name: string (nullable = true)

How to do this without a udf in PySpark?

One row of data:

{
   "results" : {
        "l" : [
            {
              "m":{
                  "Attributes" : {
                      "m" : {
                           "Score" : {"n" : "85"}
                       }
                  },
                   "OtherInfo":{
                      "l" : [
                           {
                             "m" : {
                               "Name" : {"john"}
                             }
                          },
                          {
                             "m" : {
                                "Name" : "Cena"}
                          }
                       ]
                   }
             }
           }   
         ]
    }
}
like image 338
mightyMouse Avatar asked Jun 26 '26 02:06

mightyMouse


2 Answers

Since Spark 3.1, dropFields can be used without recreating the rest of a large schema.

df = df.withColumn(
    "results",
    F.struct(F.transform(
        F.col("results.l"),
        lambda x: x.m.dropFields("Attributes")
    )).alias("l")
)

Result:

df.printSchema()
# root
#  |-- results: struct (nullable = false)
#  |    |-- l: array (nullable = false)
#  |    |    |-- element: struct (containsNull = false)
#  |    |    |    |-- OtherInfo: struct (nullable = false)
#  |    |    |    |    |-- l: array (nullable = false)
#  |    |    |    |    |    |-- element: struct (containsNull = false)
#  |    |    |    |    |    |    |-- m: struct (nullable = false)
#  |    |    |    |    |    |    |    |-- Name: string (nullable = true)
like image 53
ZygD Avatar answered Jun 27 '26 16:06

ZygD


To delete a field from a struct type you have to create a new struct with all the elements but the one you want to delete from the original struct.

Here, as the field l under results is an array, you could use transform function (Spark 2.4+) to update all its struct elements like this:

from pyspark.sql.functions import struct, expr


t_expr = "transform(results.l, x -> struct(struct(x.m.OtherInfo as OtherInfo) as m))"
df = df.withColumn("results", struct(expr(t_expr).alias("l")))

For each element x in the array, we create new struct that holds only x.m.OtherInfo field.

df.printSchema()

#root
# |-- results: struct (nullable = false)
# |    |-- l: array (nullable = true)
# |    |    |-- element: struct (containsNull = false)
# |    |    |    |-- m: struct (nullable = false)
# |    |    |    |    |-- OtherInfo: struct (nullable = true)
# |    |    |    |    |    |-- l: array (nullable = true)
# |    |    |    |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |    |    |    |-- m: struct (nullable = true)
# |    |    |    |    |    |    |    |    |-- Name: string (nullable = true)
like image 36
blackbishop Avatar answered Jun 27 '26 17:06

blackbishop



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!