Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a Pyspark Schema involving an ArrayType

I'm trying to create a schema for my new DataFrame and have tried various combinations of brackets and keywords but have been unable to figure out how to make this work. My current attempt:

from pyspark.sql.types import *

schema = StructType([
  StructField("User", IntegerType()),
  ArrayType(StructType([
    StructField("user", StringType()),
    StructField("product", StringType()),
    StructField("rating", DoubleType())]))
  ])

Comes back with the error:

elementType should be DataType
Traceback (most recent call last):
 File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 290, in __init__
assert isinstance(elementType, DataType), "elementType should be DataType"
AssertionError: elementType should be DataType   

I have googled, but so far no good examples of an array of objects.

like image 390
blissweb Avatar asked Jan 23 '18 05:01

blissweb


People also ask

How does PySpark define ArrayType?

Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.


1 Answers

You will need an additional StructField for ArrayType property. This one should work:

from pyspark.sql.types import *

schema = StructType([
  StructField("User", IntegerType()),
  StructField("My_array", ArrayType(
      StructType([
          StructField("user", StringType()),
          StructField("product", StringType()),
          StructField("rating", DoubleType())
      ])
   )
])

For more information check this link: http://nadbordrozd.github.io/blog/2016/05/22/one-weird-trick-that-will-fix-your-pyspark-schemas/

like image 191
kam Avatar answered Sep 20 '22 20:09

kam