Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark problems with imports in Python

We are running a spark-submit command on a python script that uses Spark to parallelize object detection in Python using Caffe. The script itself runs perfectly fine if run in a Python-only script, but it returns an import error when using it with Spark code. I know the spark code is not the problem because it works perfectly fine on my home machine, but it is not functioning well on AWS. I am not sure if this somehow has to do with the environment variables, it is as if it doesn't detect them.

These environment variables are set:

SPARK_HOME=/opt/spark/spark-2.0.0-bin-hadoop2.7
PATH=$SPARK_HOME/bin:$PATH
PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
PYTHONPATH=/opt/caffe/python:${PYTHONPATH}

Error:

16/10/03 01:36:21 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.31.50.167): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 161, in main
   func, profiler, deserializer, serializer = read_command(pickleSer, infile)
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
   command = serializer._read_with_length(file)
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
   return self.loads(obj)
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
   return pickle.loads(obj)
 File "/opt/spark/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 664, in subimport
   __import__(name)
ImportError: ('No module named caffe', <function subimport at 0x7efc34a68b90>, ('caffe',))

Does anyone know why this would be an issue?

This package from Yahoo manages what we're trying to do by shipping Caffe as a jar dependency and then uses it again in Python. But I haven't found any resources on how to build it and import it ourselves.

https://github.com/yahoo/CaffeOnSpark

like image 475
alfredox Avatar asked Oct 03 '16 03:10

alfredox


1 Answers

You probably haven’t compiled the caffe python wrappers in your AWS environment. For reasons that completely escape me (and several others, https://github.com/BVLC/caffe/issues/2440) pycaffe is not available as a pypi package, and you have to compile it yourself. You should follow the compilation/make instructions here or automate it using ebextensions if you are in an AWS EB environment: http://caffe.berkeleyvision.org/installation.html#python

like image 74
2ps Avatar answered Nov 14 '22 12:11

2ps