Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shipping and using virtualenv in a pyspark job

PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error:

ImportError: 
Importing the multiarray numpy extension module failed.  Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control).  Otherwise reinstall numpy.

Original error was: cannot import name multiarray

DETAIL: In my local environment I have setup a virtualenv that includes numpy as well as a private repo I use in my project and other various libraries. I created a zip file (lib/libs.zip) from the site-packages directory at venv/lib/site-packages where 'venv' is my virtual environment. I ship this zip to the remote nodes. My shell script for performing the spark-submit looks like this:

$SPARK_HOME/bin/spark-submit \
  --deploy-mode cluster \
  --master yarn \
  --conf spark.pyspark.virtualenv.enabled=true  \
  --conf spark.pyspark.virtualenv.type=native \
  --conf spark.pyspark.virtualenv.requirements=${parent}/requirements.txt \
  --conf spark.pyspark.virtualenv.bin.path=${parent}/venv \
  --py-files "${parent}/lib/libs.zip" \
  --num-executors 1 \
  --executor-cores 2 \
  --executor-memory 2G \
  --driver-memory 2G \
  $parent/src/features/pi.py

I also know that on the remote nodes there is a /usr/local/bin/python2.7 folder that includes a python 2.7 install.

so in my conf/spark-env.sh I have set the following:

export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7

When I run the script I get the error above. If I screen print the installed_distributions I get a zero length list []. Also my private library imports correctly (which says to me it is actually accessing my libs.zip site-packages.). My pi.py file looks something like this:

from myprivatelibrary.bigData.spark import spark_context
spark = spark_context()
import numpy as np
spark.parallelize(range(1, 10)).map(lambda x: np.__version__).collect()

EXPECTATION/MY THOUGHTS: I expect this to import numpy correctly especially since I know numpy works correctly in my local virtualenv. I suspect this is because I'm not actually using the version of python that is installed in my virtualenv on the remote node. My question is first, how do I fix this and second how do I use my virtualenv installed python on the remote nodes instead of the python that is just manually installed and currently sitting on those machines? I've seen some write-ups on this but frankly they are not well written.

like image 585
AntiPawn79 Avatar asked Sep 07 '17 00:09

AntiPawn79


1 Answers

With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON

Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>

What we did in such cases: we shipped an entire anaconda distribution over hdfs.

<edit 20170908>

If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.

Regarding --archives vs. --py-files:

  • --py-files adds python files/packages to the python path. From the spark-submit documentation:

    For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

  • --archives means these are extracted into the working directory of each executor (only yarn clusters).

However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.

In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.

</edit>

See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.

like image 64
akoeltringer Avatar answered Oct 13 '22 03:10

akoeltringer