Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using spark-submit with python main

Reading at this and this makes me think it is possible to have a python file be executed by spark-submit however I couldn't get it to work.

My setup is a bit complicated. I require several different jars to be submitted together with my python files in order for everything to function. My pyspark command which works is the following:

IPYTHON=1 ./pyspark --jars jar1.jar,/home/local/ANT/bogoyche/dev/rhine_workspace/env/Scala210-1.0/runtime/Scala2.10/scala-library.jar,jar2.jar --driver-class-path jar1.jar:jar2.jar
from sys import path
path.append('my-module')
from my-module import myfn
myfn(myargs)

I have packaged my python files inside an egg, and the egg contains the main file, which makes the egg executable by calling python myegg.egg

I am trying now to form my spark-submit command and I can't seem to get it right. Here's where I am:

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg arg1 arg
Error: Cannot load main class from JAR file:/path/to/pyspark/directory/arg1
Run with --help for usage help or --verbose for debug output

Instead of executing my .egg file, it is taking the first argument of the egg and considers it a jar file and tries to load a class from it? What am I doing wrong?

like image 796
XapaJIaMnu Avatar asked Jun 30 '16 10:06

XapaJIaMnu


1 Answers

One way is to have a main driver program for your Spark application as a python file (.py) that gets passed to spark-submit. This primary script has the main method to help the Driver identify the entry point. This file will customize configuration properties as well initialize the SparkContext.

The ones bundled in the egg executables are dependencies that are shipped to the executor nodes and imported inside the driver program.

You can script a small file as main driver and execute -

./spark-submit --jars jar1.jar,jar2.jar --py-files path/to/my/egg.egg driver.py arg1 arg

The driver program would be something like -

from pyspark import SparkContext, SparkConf
from my-module import myfn

if __name__ == '__main__':
    conf = SparkConf().setAppName("app")
    sc = SparkContext(conf=conf)
    myfn(myargs, sc)

Pass the SparkContext object as arguments wherever necessary.

like image 158
Shantanu Alshi Avatar answered Sep 22 '22 19:09

Shantanu Alshi