how to pass python package to spark job and invoke main file from package with arguments

Tags:

I have my python code with a structure like,

Project1
--src
----util.py
----job1.py
----job2.py
--config
----config1.json
----config2.json

I want to run this job1 in spark but these I just cannot invoke job1.py because its dependent on other files like util.py and job2.py and config files and thus I need to pass complete package as an input to spark.

I tried running spark-submit job1.py but it fails with dependencies like job2.py and util.py because they are not available to executors.

Based on spark documentation, I see --files is an option to do this but it works by passing all filenames to spark-submit which looks difficult if number of files in codebase in future.

Another option I see is passing code zip file with --archive option but still it fails because not able to reference files in zip.

So Can anyone suggest any other way to run such codebase in spark?

293

asked Dec 20 '17 11:12

user3347819

1 Answers

There a few basic steps:

Create a Python package.
Either build egg file or create a simple zip archive.
Add package as a dependency using --py-files / pyFiles.
Create a thin main.py which invokes functions from the package and submit it to Spark cluster.

158

answered Sep 30 '22 20:09

Alper t. Turker

Related questions
                            
                                Python: convert 'days since 1990' to datetime object
                            
                                Convert a epoch timestamp to yyyy/mm/dd hh:mm
                            
                                ImportError: No module named 'cryptography'
                            
                                How to recursively query in django efficiently?
                            
                                Non-blocking file read
                            
                                How to independently set horizontal and vertical, major and minor grid lines of a plot?
                            
                                Pandas equivalent to SQL window functions
                            
                                Selenium: Runtime.executionContextCreated has invalid 'context':
                            
                                Get length of Queue in Python's multiprocessing library
                            
                                Change x-axis ticks to custom strings
                            
                                return n smallest indexes by column using pandas
                            
                                Sklearn logistic regression, plotting probability curve graph
                            
                                Difference between pandas .iloc and .iat?
                            
                                PyPlot move alternative y axis to background
                            
                                Viewing dataframes in Spyder using a command in its console
                            
                                Installing PySide for Python3
                            
                                How convert a list of tupes to a numpy array of tuples?
                            
                                How to save a huge pandas dataframe to hdfs?
                            
                                Python unit test coverage for multiple modules
                            
                                numpy array: IndexError: too many indices for array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to pass python package to spark job and invoke main file from package with arguments

Tags:

python

apache-spark

pyspark

user3347819

People also ask

1 Answers

Alper t. Turker

Recent Activity

Donate For Us