I use PySpark 2.4.0 and when I executed the following code in pyspark
:
$ ./bin/pyspark
Python 2.7.16 (default, Mar 25 2019, 15:07:04)
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 2.7.16 (default, Mar 25 2019 15:07:04)
SparkSession available as 'spark'.
>>> from pyspark.sql.functions import pandas_udf
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> from pyspark.sql.types import IntegerType, StringType
>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/x/spark/python/pyspark/sql/functions.py", line 2922, in pandas_udf
return _create_udf(f=f, returnType=return_type, evalType=eval_type)
File "/Users/x/spark/python/pyspark/sql/udf.py", line 47, in _create_udf
require_minimum_pyarrow_version()
File "/Users/x/spark/python/pyspark/sql/utils.py", line 149, in require_minimum_pyarrow_version
"it was not found." % minimum_pyarrow_version)
ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.
How to fix it?
The error message in this case is misleading, pyarrow
wasn't installed.
From the official documentation Spark SQL Guide (that led to Installing PyArrow), you should simply execute one of the following commands:
$ conda install -c conda-forge pyarrow
or
$ pip install pyarrow
It is also important to run it under proper user and Python version. I.e., if one is using Zeppelin under root with Python3, it might be needed to execute
# pip3 install pyarrow
instead
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With