I am trying to do the following:
subschema = T.ArrayType(T.StructType([
T.StructField("AA", T.LongType(), True),
T.StructField("BB", T.StringType(), True),
]), True)
s = T.StructType([
T.StructField("B", subschema, True),
T.StructField("A", T.StringType(), True),
])
d = [Row(
B=None,
A="AAA",
)]
df = spark.createDataFrame(d, schema=s)
But I am getting an error that does not make sense to me:
ValueError: Unexpected tuple 'A' with StructType
If I comment either row A or row B, the error disappears, but I don't understand why this is happening. What is the problem? Is this a bug, or is there something wrong in my code?
This is due to the alphabetical ordering of the fields when you create a Row using keyword arguments. Here it tries to apply the type of B to the field A.
In Spark 3, this was removed, I was able to run your code without any error.
For Spark < 3, you need to sort the fields in your schema too, A before B :
s = T.StructType([
T.StructField("A", T.StringType(), True),
T.StructField("B", subschema, True)
])
Or simply create RDD from tuple:
rdd = sc.parallelize([(None, "AAA")])
df = spark.createDataFrame(rdd, schema=s)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With