Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark import user defined module or .py files

Tags:

I built a python module and I want to import it in my pyspark application.

My package directory structure is:

wesam/ |-- data.py `-- __init__.py 

A simple import wesam at the top of my pyspark script leads to ImportError: No module named wesam. I also tried to zip it and ship it with my code with --py-files as recommended in this answer, with no luck.

./bin/spark-submit --py-files wesam.zip mycode.py 

I also added the file programmatically as suggested by this answer, but I got the same ImportError: No module named wesam error.

.sc.addPyFile("wesam.zip") 

What am I missing here?

like image 243
Sam Avatar asked Apr 21 '17 00:04

Sam


People also ask

Is PySpark a Python library?

PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.


2 Answers

It turned out that since I'm submitting my application in client mode, then the machine I run the spark-submit command from will run the driver program and will need to access the module files.

enter image description here

I added my module to the PYTHONPATH environment variable on the node I'm submitting my job from by adding the following line to my .bashrc file (or execute it before submitting my job).

export PYTHONPATH=$PYTHONPATH:/home/welshamy/modules 

And that solved the problem. Since the path is on the driver node, I don't have to zip and ship the module with --py-files or use sc.addPyFile().

The key to solving any pyspark module import error problem is understanding whether the driver or worker (or both) nodes need the module files.

Important If the worker nodes need your module files, then you need to pass it as a zip archive with --py-files and this argument must precede your .py file argument. For example, notice the order of arguments in these examples:

This is correct:

./bin/spark-submit --py-files wesam.zip mycode.py 

this is not correct:

./bin/spark-submit mycode.py --py-files wesam.zip 
like image 97
Sam Avatar answered Sep 22 '22 10:09

Sam


Put mycode.py and wesam.py in the same path location and try

sc.addPyFile("wesam.py")

It might work.

like image 38
Dj.OU Avatar answered Sep 23 '22 10:09

Dj.OU