I'm trying to create a schema for my new DataFrame and have tried various combinations of brackets and keywords but have been unable to figure out how to make this work. My current attempt:
from pyspark.sql.types import *
schema = StructType([
StructField("User", IntegerType()),
ArrayType(StructType([
StructField("user", StringType()),
StructField("product", StringType()),
StructField("rating", DoubleType())]))
])
Comes back with the error:
elementType should be DataType
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 290, in __init__
assert isinstance(elementType, DataType), "elementType should be DataType"
AssertionError: elementType should be DataType
I have googled, but so far no good examples of an array of objects.
Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.
You will need an additional StructField
for ArrayType
property. This one should work:
from pyspark.sql.types import *
schema = StructType([
StructField("User", IntegerType()),
StructField("My_array", ArrayType(
StructType([
StructField("user", StringType()),
StructField("product", StringType()),
StructField("rating", DoubleType())
])
)
])
For more information check this link: http://nadbordrozd.github.io/blog/2016/05/22/one-weird-trick-that-will-fix-your-pyspark-schemas/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With