Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert array to struct pyspark

I'm quite new to pyspark and I have a dataframe that currently looks like below.

| col1                            | col2              |
+---------------------------------+-------------------+
| [(a, 0)], [(b,0)].....[(z,1)]   | [0, 0, ... 1]     |
| [(b, 0)], [(b,1)].....[(z,0)]   | [0, 1, ... 0]     |
| [(a, 0)], [(c, 1)].....[(z,0)]  | [0, 1, ... 0]     |

I extracted values from col1.QueryNum into col2 and when I print the schema, it's an array containing the list of number from col1.QueryNum.

Ultimately my goal is to convert the list values in col2 into struct format inside pyspark(refer to desired schema).

Current Schema

 |-- col1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- types: string (nullable = true)
 |    |    |-- QueryNum: integer (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: integer (containsNull = true)

Desired Schema

 |-- col2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- val1: integer (nullable = true)
 |    |    |-- val2: integer (nullable = true)
                 .
                 .
                 .
 |    |    |-- val80: integer (nullable = true)

I tried using from_json and it's not really working.

like image 741
jynn Avatar asked Aug 31 '25 01:08

jynn


1 Answers

If the you have fixed array size you can create struct using list-comprehension:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "col2",
    F.array(
        F.struct(*[
            F.col("col1")[i]["QueryNum"].alias(f"val{i+1}") for i in range(2)
        ])
    )
)

df1.show()
#+----------------+--------+
#|col1            |col2    |
#+----------------+--------+
#|[[0, a], [0, b]]|[[0, 0]]|
#|[[0, b], [1, b]]|[[0, 1]]|
#|[[0, a], [1, c]]|[[0, 1]]|
#+----------------+--------+

df1.printSchema()
#root
#|-- col1: array (nullable = true)
#|    |-- element: struct (containsNull = true)
#|    |    |-- QueryNum: long (nullable = true)
#|    |    |-- types: string (nullable = true)
#|-- col2: array (nullable = false)
#|    |-- element: struct (containsNull = false)
#|    |    |-- val1: long (nullable = true)
#|    |    |-- val2: long (nullable = true)

Note however that there is no need to use array in this case as you'll always have one struct in that array. Just use simple struct:

df1 = df.withColumn(
    "col2",
    F.struct(*[
        F.col("col1")[i]["QueryNum"].alias(f"val{i+1}") for i in range(2)
    ])
)

Or if you prefer a map type:

df1 = df.withColumn(
    "col2",
    F.map_from_entries(
        F.expr("transform(col1, (x,i) -> struct('val' || (i+1) as name, x.QueryNum as value))")
    )
)
like image 162
blackbishop Avatar answered Sep 02 '25 14:09

blackbishop