Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to pass python package to spark job and invoke main file from package with arguments

I have my python code with a structure like,

Project1
--src
----util.py
----job1.py
----job2.py
--config
----config1.json
----config2.json

I want to run this job1 in spark but these I just cannot invoke job1.py because its dependent on other files like util.py and job2.py and config files and thus I need to pass complete package as an input to spark.

I tried running spark-submit job1.py but it fails with dependencies like job2.py and util.py because they are not available to executors.

Based on spark documentation, I see --files is an option to do this but it works by passing all filenames to spark-submit which looks difficult if number of files in codebase in future.

Another option I see is passing code zip file with --archive option but still it fails because not able to reference files in zip.

So Can anyone suggest any other way to run such codebase in spark?

like image 293
user3347819 Avatar asked Dec 20 '17 11:12

user3347819


People also ask

How do I include a file in a Python package?

Place the files that you want to include in the package directory (in our case, the data has to reside in the roman/ directory). Add the field include_package_data=True in setup.py. Add the field package_data={'': [... patterns for files you want to include, relative to package dir...]} in setup.py .

What is package in Python explain how can you use package in your program with an example code?

A package is a hierarchical file directory structure that defines a single Python application environment that consists of modules and subpackages and sub-subpackages, and so on. After you add these lines to __init__.py, you have all of these classes available when you import the Phone package.


1 Answers

There a few basic steps:

  • Create a Python package.
  • Either build egg file or create a simple zip archive.
  • Add package as a dependency using --py-files / pyFiles.
  • Create a thin main.py which invokes functions from the package and submit it to Spark cluster.
like image 158
Alper t. Turker Avatar answered Sep 30 '22 20:09

Alper t. Turker