Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS EMR pandas conflict with numpy in pyspark after bootstrapping

After launching cluster with the below bootstrap code and getting the below stdout, when I try to import pandas in pyspark, i get the following error due to conflict with a different numpy version which is not there in stdout. So it seems pyspark is selectively ignoring the numpy installation made and using an old version which is causing the conflict. How Can I resolve this?

The emr version I am using is emr-5.33.0

import pandas as pd
  File "/usr/local/lib64/python3.7/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import (
  File "/usr/local/lib64/python3.7/site-packages/pandas/compat/__init__.py", line 15, in <module>
    from pandas.compat.numpy import (
  File "/usr/local/lib64/python3.7/site-packages/pandas/compat/numpy/__init__.py", line 21, in <module>
    f"this version of pandas is incompatible with numpy < {_min_numpy_ver}\n"
ImportError: this version of pandas is incompatible with numpy < 1.17.3
your numpy version is 1.16.5.
Please upgrade numpy to >= 1.17.3 to use this pandas version

Here is the bootstrapping code I am using

#!/bin/bash
set -x -e

echo -e 'export PYSPARK_PYTHON=/usr/bin/python3
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_JARS_DIR=/usr/lib/spark/jars
export SPARK_HOME=/usr/lib/spark' >> $HOME/.bashrc && source $HOME/.bashrc

sudo python3 -m pip install
sudo python3 -m pip install numpy pandas awscli boto spark-nlp
sudo python3 -m pip freeze
sudo ls /usr/local/lib64/python3.7/site-packages/


set +x
exit 0

Here is the software config I am giving

[{
  "Classification": "spark-env",
  "Configurations": [{
    "Classification": "export",
    "Properties": {
      "PYSPARK_PYTHON": "/usr/bin/python3"
    }
  }]
},
{
  "Classification": "spark-defaults",
    "Properties": {
      "spark.yarn.stagingDir": "hdfs:///tmp",
      "spark.yarn.preserve.staging.files": "true",
      "spark.kryoserializer.buffer.max": "2000M",
      "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
      "spark.driver.maxResultSize": "0",
      "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2"
    }
}
]

Here is the stout I am getting after bootstrapping

Collecting numpy
  Downloading https://files.pythonhosted.org/packages/2c/d2/8973eb282fc3c7e6c4db0469f0390d81d8eb9ae56dfaa2a7e6db07283682/numpy-1.21.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (14.1MB)
Installing collected packages: numpy
Successfully installed numpy-1.21.0
Collecting pandas
  Downloading https://files.pythonhosted.org/packages/99/f7/01cea7f6c963100f045876eb4aa1817069c5c9eca73d2dbfb5d31ff9a39f/pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8MB)
Collecting awscli
  Downloading https://files.pythonhosted.org/packages/aa/24/e098cf5ce28a764bca174e88f4ccb70754e9f049c9bf986e582aedcb7420/awscli-1.19.112-py2.py3-none-any.whl (3.6MB)
Requirement already satisfied: boto in /usr/local/lib/python3.7/site-packages
Collecting spark-nlp
  Downloading https://files.pythonhosted.org/packages/6a/98/5e860fdd0227b8eac3907acd5f896c9b2aae0a93cd676aaaf2aa4f48dfe0/spark_nlp-3.1.2-py2.py3-none-any.whl (45kB)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas)
Requirement already satisfied: numpy>=1.17.3 in /root/.local/lib/python3.7/site-packages (from pandas)
Collecting python-dateutil>=2.7.3 (from pandas)
  Downloading https://files.pythonhosted.org/packages/36/7a/87837f39d0296e723bb9b62bbb257d0355c7f6128853c78955f57342a56d/python_dateutil-2.8.2-py2.py3-none-any.whl (247kB)
Collecting rsa<4.8,>=3.1.2; python_version > "2.7" (from awscli)
  Downloading https://files.pythonhosted.org/packages/e9/93/0c0f002031f18b53af7a6166103c02b9c0667be528944137cc954ec921b3/rsa-4.7.2-py3-none-any.whl
Collecting docutils<0.16,>=0.10 (from awscli)
  Downloading https://files.pythonhosted.org/packages/22/cd/a6aa959dca619918ccb55023b4cb151949c64d4d5d55b3f4ffd7eee0c6e8/docutils-0.15.2-py3-none-any.whl (547kB)
Requirement already satisfied: PyYAML<5.5,>=3.10 in /usr/local/lib64/python3.7/site-packages (from awscli)
Collecting s3transfer<0.5.0,>=0.4.0 (from awscli)
  Downloading https://files.pythonhosted.org/packages/63/d0/693477c688348654ddc21dcdce0817653a294aa43f41771084c25e7ff9c7/s3transfer-0.4.2-py2.py3-none-any.whl (79kB)
Collecting colorama<0.4.4,>=0.2.5 (from awscli)
  Downloading https://files.pythonhosted.org/packages/c9/dc/45cdef1b4d119eb96316b3117e6d5708a08029992b2fee2c143c7a0a5cc5/colorama-0.4.3-py2.py3-none-any.whl
Collecting botocore==1.20.112 (from awscli)
  Downloading https://files.pythonhosted.org/packages/c7/ea/11c3beca131920f552602b98d7ba9fc5b46bee6a59cbd48a95a85cbb8f41/botocore-1.20.112-py2.py3-none-any.whl (7.7MB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas)
Collecting pyasn1>=0.1.3 (from rsa<4.8,>=3.1.2; python_version > "2.7"->awscli)
  Downloading https://files.pythonhosted.org/packages/62/1e/a94a8d635fa3ce4cfc7f506003548d0a2447ae76fd5ca53932970fe3053f/pyasn1-0.4.8-py2.py3-none-any.whl (77kB)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.7/site-packages (from botocore==1.20.112->awscli)
Collecting urllib3<1.27,>=1.25.4 (from botocore==1.20.112->awscli)
  Downloading https://files.pythonhosted.org/packages/5f/64/43575537846896abac0b15c3e5ac678d787a4021e906703f1766bfb8ea11/urllib3-1.26.6-py2.py3-none-any.whl (138kB)
Installing collected packages: python-dateutil, pandas, pyasn1, rsa, docutils, urllib3, botocore, s3transfer, colorama, awscli, spark-nlp
Successfully installed awscli-1.19.112 botocore-1.20.112 colorama-0.4.3 docutils-0.15.2 pandas-1.3.0 pyasn1-0.4.8 python-dateutil-2.8.2 rsa-4.7.2 s3transfer-0.4.2 spark-nlp-3.1.2 urllib3-1.26.6
awscli==1.19.112
beautifulsoup4==4.9.3
boto==2.49.0
botocore==1.20.112
click==7.1.2
colorama==0.4.3
docutils==0.15.2
jmespath==0.10.0
joblib==1.0.1
lxml==4.6.2
mysqlclient==1.4.2
nltk==3.5
nose==1.3.4
numpy==1.21.0
pandas==1.3.0
py-dateutil==2.2
pyasn1==0.4.8
python-dateutil==2.8.2
pytz==2021.1
PyYAML==5.4.1
regex==2021.3.17
rsa==4.7.2
s3transfer==0.4.2
six==1.13.0
spark-nlp==3.1.2
tqdm==4.59.0
urllib3==1.26.6
windmill==1.6
click
click-7.1.2.dist-info
joblib
joblib-1.0.1.dist-info
lxml
lxml-4.6.2-py3.7.egg-info
mysqlclient-1.4.2-py3.7.egg-info
MySQLdb
pandas
pandas-1.3.0.dist-info
PyYAML-5.4.1-py3.7.egg-info
regex
regex-2021.3.17-py3.7.egg-info
tqdm
tqdm-4.59.0.dist-info
yaml
_yaml
like image 428
Rajarshi Bhadra Avatar asked Jul 16 '21 09:07

Rajarshi Bhadra


2 Answers

This issue is actually an EMR bug and is being discussed on the AWS forums here: https://forums.aws.amazon.com/thread.jspa?messageID=989210&tstart=0

I am facing the same issue on emr 6.3.0; my solution was to set pandas=1.2.5 in the bootstrap script. This is a quick fix until AWS fixes the issue.

Additionally, I see a few solutions/hacks were posted here.

How do I have multiple versions of numpy installed on Amazon EMR and how to I delete the early versions?

like image 97
r_g_s_ Avatar answered Oct 20 '22 15:10

r_g_s_


I faced the same issue. Basically I added it in as a EMR step instead of a bootstrap script and it worked for me. This may not be appropriate if you are somehow indexed on EMR cluster state change, but should unjam a lot of scenarios where this is not the requirement. More details here

like image 36
vecktorking Avatar answered Oct 20 '22 16:10

vecktorking