How to cast string to ArrayType of dictionary (JSON) in PySpark

Trying to cast StringType to ArrayType of JSON for a dataframe generated form CSV.

Using pyspark on Spark2

The CSV file I am dealing with; is as follows -


As shown above, it contains one attribute "attribute3" in literal string, which is technically a list of dictionary(JSON) with exact length of 2. (This is the output of function distinct)

Snippet from the printSchema()

attribute3: string (nullable = true)

I am trying to cast the "attribute3" to ArrayType as follows

temp = dataframe.withColumn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() takes at least 2 arguments (1 given)

Indeed, ArrayType expects datatype as argument. I tried with "json", but it did not work.

Desired Output - In the end, I need to convert attribute3 to ArrayType() or plain simple Python list. (I am trying to avoid use of eval)

How do I convert it to ArrayType, so that I can treat it as list of JSONs?

Am I missing anything here?

(The documentation,does not address this problem in straightforward way)

Use from_json with a schema that matches the actual data in attribute3 column to convert json to ArrayType:

Original data frame:

# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: string (nullable = true)

from pyspark.sql.functions import from_json
from pyspark.sql.types import *

Create the schema:

schema = ArrayType(
    StructType([StructField("key", StringType()), 
                StructField("key2", IntegerType())]))

Use from_json:

df = df.withColumn("attribute3", from_json(df.attribute3, schema))

# |-- date: string (nullable = true)
# |-- attribute2: string (nullable = true)
# |-- count: long (nullable = true)
# |-- attribute3: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- key: string (nullable = true)
# |    |    |-- key2: integer (nullable = true)

df.show(1, False)
#|date      |attribute2|count|attribute3                          |
#|2017-09-03|attribute1|2    |[[value, 2], [value, 2], [value, 2]]|
