Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RuntimeError: Unsupported type in conversion to Arrow: VectorUDT

I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
result.toPandas()

But, I got the error:

TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
   1949                 import pyarrow
-> 1950                 to_arrow_schema(self.schema)
   1951                 tables = self._collectAsArrow()

/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_schema(schema)
   1650     fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651               for field in schema]
   1652     return pa.schema(fields)

/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in <listcomp>(.0)
   1650     fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651               for field in schema]
   1652     return pa.schema(fields)

/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_type(dt)
   1641     else:
-> 1642         raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
   1643     return arrow_type

TypeError: Unsupported type in conversion to Arrow: VectorUDT

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-138-4e12457ff4d5> in <module>()
      1 spark.conf.set("spark.sql.execution.arrow.enabled", "true")
----> 2 result.toPandas()

/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
   1962                     "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
   1963                     "to disable this.")
-> 1964                 raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
   1965         else:
   1966             pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

RuntimeError: Unsupported type in conversion to Arrow: VectorUDT
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.

It's not working, but if I set arrow to false, it works. But It's so slow... Any idea?

like image 260
Saeid SOHEILY KHAH Avatar asked Jul 04 '18 13:07

Saeid SOHEILY KHAH


1 Answers

Arrow supports only a small set of types, and Spark UserDefinedTypes, including ml and mllib VectorUDTs are not among supported ones.

If you rally want to use arrow you'll have to convert your data to a format that it is supported. One possible solution is to expand Vectors into columns - How to split Vector into columns - using PySpark

You can also serialize output using to_json method:

from pyspark.sql.functions import to_json

 df.withColumn("your_vector_column", to_json("your_vector_column"))

but if data is large enough for toPandas to be a serious bottleneck, then I would reconsider collecting data like this.

like image 95
Alper t. Turker Avatar answered Nov 20 '22 05:11

Alper t. Turker