I want to convert a big spark data frame to Pandas with more than 1000000 rows. I tried to convert a spark data Frame to Pandas data frame using the following code:
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
result.toPandas()
But, I got the error:
TypeError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1949 import pyarrow
-> 1950 to_arrow_schema(self.schema)
1951 tables = self._collectAsArrow()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_schema(schema)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in <listcomp>(.0)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_type(dt)
1641 else:
-> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
1643 return arrow_type
TypeError: Unsupported type in conversion to Arrow: VectorUDT
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-138-4e12457ff4d5> in <module>()
1 spark.conf.set("spark.sql.execution.arrow.enabled", "true")
----> 2 result.toPandas()
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1962 "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
1963 "to disable this.")
-> 1964 raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
1965 else:
1966 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
RuntimeError: Unsupported type in conversion to Arrow: VectorUDT
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.
It's not working, but if I set arrow to false, it works. But It's so slow... Any idea?
Arrow supports only a small set of types, and Spark UserDefinedTypes
, including ml
and mllib
VectorUDTs
are not among supported ones.
If you rally want to use arrow you'll have to convert your data to a format that it is supported. One possible solution is to expand Vectors
into columns - How to split Vector into columns - using PySpark
You can also serialize output using to_json
method:
from pyspark.sql.functions import to_json
df.withColumn("your_vector_column", to_json("your_vector_column"))
but if data is large enough for toPandas
to be a serious bottleneck, then I would reconsider collecting data like this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With