Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix "ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found."?

I use PySpark 2.4.0 and when I executed the following code in pyspark:

$ ./bin/pyspark
Python 2.7.16 (default, Mar 25 2019, 15:07:04)
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 2.7.16 (default, Mar 25 2019 15:07:04)
SparkSession available as 'spark'.
>>> from pyspark.sql.functions import pandas_udf
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType
>>> from pyspark.sql.types import IntegerType, StringType
>>> slen = pandas_udf(lambda s: s.str.len(), IntegerType())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/x/spark/python/pyspark/sql/functions.py", line 2922, in pandas_udf
    return _create_udf(f=f, returnType=return_type, evalType=eval_type)
  File "/Users/x/spark/python/pyspark/sql/udf.py", line 47, in _create_udf
    require_minimum_pyarrow_version()
  File "/Users/x/spark/python/pyspark/sql/utils.py", line 149, in require_minimum_pyarrow_version
    "it was not found." % minimum_pyarrow_version)
ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

How to fix it?

like image 631
Jacek Laskowski Avatar asked Mar 04 '23 06:03

Jacek Laskowski


1 Answers

The error message in this case is misleading, pyarrow wasn't installed.

From the official documentation Spark SQL Guide (that led to Installing PyArrow), you should simply execute one of the following commands:

$ conda install -c conda-forge pyarrow

or

$ pip install pyarrow

It is also important to run it under proper user and Python version. I.e., if one is using Zeppelin under root with Python3, it might be needed to execute

# pip3 install pyarrow

instead

like image 85
Jacek Laskowski Avatar answered Mar 07 '23 22:03

Jacek Laskowski