I would like to initialize config once, and then use it in many modules of my PySpark project.
I see 2 ways to do it.
main.py:
with open(sys.argv[1]) as f:
config = json.load(f)
df = load_df(config)
df = parse(df, config)
df = validate(df, config, strict=True)
dump(df, config)
But it seems unbeauty to pass one external argument to each function.
config.py
import sys
import json
with open(sys.argv[1]) as f:
config = json.load(f)
main.py
from config import config
df = load_df()
df = parse(df)
df = validate(df, strict=True)
dump(df)
and in each module add row
from config import config
It seems more beauty because config is not, strictly speaking, an argument of function. It is general context where they execute.
Unfortunately, PySpark pickle config.py and tries to execute it on server, but doesn't pass sys.argv to them! So, I see error when run it
File "/PycharmProjects/spark_test/config.py", line 6, in <module>
CONFIG_PATH = sys.argv[1]
IndexError: list index out of range
What is the best practice to work with general config, loaded from file, in PySpark?
Your program starts execution on master and passes main bulk of its work to executors by invoking some functions on them. The executors are different processes that are typically run on different physical machines.
Thus anything that the master would want to reference on executors needs to be either a standard library function (to which the executors have access) or a pickelable object that can be sent over.
You typically don't want to load and parse any external resources on the executors, since you would always have to copy them over and make sure you load them properly... Passing a pickelable object as an argument of the function (e.g. for a UDF) works much better, since there is only one place in your code where you need to load it.
I would suggest creating a config.py
file and add it as an argument to your spark-submit
command:
spark-submit --py-files /path/to/config.py main_program.py
Then you can create spark context like this:
spark_context = SparkContext(pyFiles=['/path/to/config.py'])
and simply use import config
wherever you need.
You can even include whole python packages in a tree packaged as a single zip file instead of just a single config.py
file, but then be sure to include __init__.py
in every folder that needs to be referenced as python module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With