Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to fix "ImportError: Pandas >= 0.19.2 must be installed; however, it was not found"?

I use Spark 2.3.1 and want to use toPandas() (to use unique()).

When I execute the following code in pyspark:

df.toPandas()['column_01'].unique()

I'm facing the following exception:

>>> df.toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/spark/python/pyspark/sql/dataframe.py", line 2075, in toPandas
    require_minimum_pandas_version()
  File "/Users/xxx/spark/python/pyspark/sql/utils.py", line 129, in require_minimum_pandas_version
    "it was not found." % minimum_pandas_version)
ImportError: Pandas >= 0.19.2 must be installed; however, it was not found.

How to fix it?

like image 691
Abhi Avatar asked Oct 27 '25 09:10

Abhi


2 Answers

You would need to install pandas: pip install pandas . Also, to get the unique values, you don't need to convert to pandas dataframe. You can achieve that in spark dataframe.

df.select('column_01').distinct()

like image 110
Manoj Singh Avatar answered Oct 29 '25 05:10

Manoj Singh


I know this is an old question, but I recently struggled with the same problem when deploying a pyspark job to Google Dataproc. The solution that worked for me was the following:

When creating the cluster, specify the following:

--metadata 'PIP_PACKAGES=pandas==0.23.0'
like image 39
Christiaan De Villiers Avatar answered Oct 29 '25 05:10

Christiaan De Villiers



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!