I have my python code with a structure like,
Project1
--src
----util.py
----job1.py
----job2.py
--config
----config1.json
----config2.json
I want to run this job1 in spark but these I just cannot invoke job1.py because its dependent on other files like util.py and job2.py and config files and thus I need to pass complete package as an input to spark.
I tried running spark-submit job1.py
but it fails with dependencies like job2.py and util.py because they are not available to executors.
Based on spark documentation, I see --files is an option to do this but it works by passing all filenames to spark-submit which looks difficult if number of files in codebase in future.
Another option I see is passing code zip file with --archive option but still it fails because not able to reference files in zip.
So Can anyone suggest any other way to run such codebase in spark?
Place the files that you want to include in the package directory (in our case, the data has to reside in the roman/ directory). Add the field include_package_data=True in setup.py. Add the field package_data={'': [... patterns for files you want to include, relative to package dir...]} in setup.py .
A package is a hierarchical file directory structure that defines a single Python application environment that consists of modules and subpackages and sub-subpackages, and so on. After you add these lines to __init__.py, you have all of these classes available when you import the Phone package.
There a few basic steps:
egg
file or create a simple zip
archive.--py-files
/ pyFiles
.main.py
which invokes functions from the package and submit it to Spark cluster.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With