Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does spark-submit in YARN cluster mode not find python packages on executors?

I am running a boo.py script on AWS EMR using spark-submit (Spark 2.0).

The file finished successfully when I use

python boo.py

However, it failed when I run

spark-submit --verbose --deploy-mode cluster --master yarn  boo.py

The log on yarn logs -applicationId ID_number shows:

Traceback (most recent call last):
File "boo.py", line 17, in <module>
import boto3
ImportError: No module named boto3

The python and boto3 module I am using is

$ which python
/usr/bin/python
$ pip install boto3
Requirement already satisfied (use --upgrade to upgrade): boto3 in /usr/local/lib/python2.7/site-packages

How do I append this library path so that spark-submit could read the boto3 module?

like image 730
Frank The Tank Avatar asked Oct 19 '22 01:10

Frank The Tank


1 Answers

When you are running spark, part of the code is running on the driver, and part is running on the executors.

Did you install boto3 on the driver only, or on driver + all executors (nodes) which might run your code?

One solution might be - to install boto3 on all executors (nodes)

how to install python modules on Amazon EMR nodes:

How to bootstrap installation of Python modules on Amazon EMR?

like image 118
Yaron Avatar answered Nov 15 '22 06:11

Yaron