Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unexpected tuple with StructType - Error in pyspark when using schema to create a data frame

I am trying to do the following:

subschema = T.ArrayType(T.StructType([
    T.StructField("AA", T.LongType(), True),
    T.StructField("BB", T.StringType(), True),
]), True)
s = T.StructType([
    T.StructField("B", subschema, True),
    T.StructField("A", T.StringType(), True),
])
d = [Row(
    B=None,
    A="AAA",
)]
df = spark.createDataFrame(d, schema=s)

But I am getting an error that does not make sense to me: ValueError: Unexpected tuple 'A' with StructType

If I comment either row A or row B, the error disappears, but I don't understand why this is happening. What is the problem? Is this a bug, or is there something wrong in my code?

like image 521
someguy Avatar asked Feb 13 '26 11:02

someguy


1 Answers

This is due to the alphabetical ordering of the fields when you create a Row using keyword arguments. Here it tries to apply the type of B to the field A.

In Spark 3, this was removed, I was able to run your code without any error.

For Spark < 3, you need to sort the fields in your schema too, A before B :

s = T.StructType([
    T.StructField("A", T.StringType(), True),
    T.StructField("B", subschema, True)
])

Or simply create RDD from tuple:

rdd = sc.parallelize([(None, "AAA")])
df = spark.createDataFrame(rdd, schema=s)
like image 194
blackbishop Avatar answered Feb 17 '26 00:02

blackbishop