Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best PySpark practice to load config from external file

I would like to initialize config once, and then use it in many modules of my PySpark project.

I see 2 ways to do it.

  1. load it in entry point and pass as an argument to each function

main.py:

with open(sys.argv[1]) as f:
    config = json.load(f)
df = load_df(config)
df = parse(df, config)
df = validate(df, config, strict=True)
dump(df, config)

But it seems unbeauty to pass one external argument to each function.

  1. Load config in config.py and import this object in each module

config.py

import sys
import json
with open(sys.argv[1]) as f:
    config = json.load(f)

main.py

from config import config
df = load_df()
df = parse(df)
df = validate(df, strict=True)
dump(df)

and in each module add row

from config import config

It seems more beauty because config is not, strictly speaking, an argument of function. It is general context where they execute.

Unfortunately, PySpark pickle config.py and tries to execute it on server, but doesn't pass sys.argv to them! So, I see error when run it

  File "/PycharmProjects/spark_test/config.py", line 6, in <module>
    CONFIG_PATH = sys.argv[1]
IndexError: list index out of range

What is the best practice to work with general config, loaded from file, in PySpark?

like image 710
vladimirfol Avatar asked Sep 25 '17 13:09

vladimirfol


1 Answers

Your program starts execution on master and passes main bulk of its work to executors by invoking some functions on them. The executors are different processes that are typically run on different physical machines.

Thus anything that the master would want to reference on executors needs to be either a standard library function (to which the executors have access) or a pickelable object that can be sent over.

You typically don't want to load and parse any external resources on the executors, since you would always have to copy them over and make sure you load them properly... Passing a pickelable object as an argument of the function (e.g. for a UDF) works much better, since there is only one place in your code where you need to load it.

I would suggest creating a config.py file and add it as an argument to your spark-submit command:

spark-submit --py-files /path/to/config.py main_program.py

Then you can create spark context like this:

spark_context = SparkContext(pyFiles=['/path/to/config.py'])

and simply use import config wherever you need.

You can even include whole python packages in a tree packaged as a single zip file instead of just a single config.py file, but then be sure to include __init__.py in every folder that needs to be referenced as python module.

like image 55
grepe Avatar answered Sep 23 '22 02:09

grepe