I am sending a Spark job to run on a remote cluster by running
spark-submit ... --deploy-mode cluster --files some.properties ...
I want to read the content of the some.properties
file by the driver code, i.e. before creating the Spark context and launching RDD tasks. The file is copied to the remote driver, but not to the driver's working directory.
The ways around this problem that I know of are:
Both are inconvenient since this file is frequently changed on the submitting dev machine.
Is there a way to read the file that was uploaded using the --files
flag during the driver code main method?
Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.
Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution, At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.
Yes, you can access files uploaded via the --files
argument.
This is how I'm able to access files passed in via --files
:
./bin/spark-submit \ --class com.MyClass \ --master yarn-cluster \ --files /path/to/some/file.ext \ --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \ /path/to/app.jar file.ext
and in my Spark code:
val filename = args(0) val linecount = Source.fromFile(filename).getLines.size
I do believe these files are downloaded onto the workers in the same directory as the jar is placed, which is why simply passing the filename and not the absolute path to Source.fromFile
works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With