Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read files sent with spark-submit by the driver

Tags:

apache-spark

I am sending a Spark job to run on a remote cluster by running

spark-submit ... --deploy-mode cluster --files some.properties ... 

I want to read the content of the some.properties file by the driver code, i.e. before creating the Spark context and launching RDD tasks. The file is copied to the remote driver, but not to the driver's working directory.

The ways around this problem that I know of are:

  1. Upload the file to HDFS
  2. Store the file in the app jar

Both are inconvenient since this file is frequently changed on the submitting dev machine.

Is there a way to read the file that was uploaded using the --files flag during the driver code main method?

like image 590
Little Bobby Tables Avatar asked Jan 20 '16 12:01

Little Bobby Tables


People also ask

What happens when we submit Spark submit?

Once you do a Spark submit, a driver program is launched and this requests for resources to the cluster manager and at the same time the main program of the user function of the user processing program is initiated by the driver program.

Can you explain what happens internally when we submit a Spark job using Spark submit?

Spark translates the RDD transformations into something called DAG (Directed Acyclic Graph) and starts the execution, At high level, when any action is called on the RDD, Spark creates the DAG and submits to the DAG scheduler.


1 Answers

Yes, you can access files uploaded via the --files argument.

This is how I'm able to access files passed in via --files:

./bin/spark-submit \ --class com.MyClass \ --master yarn-cluster \ --files /path/to/some/file.ext \ --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-rdbms-3.2.9.jar,lib/datanucleus-core-3.2.10.jar \ /path/to/app.jar file.ext 

and in my Spark code:

val filename = args(0) val linecount = Source.fromFile(filename).getLines.size 

I do believe these files are downloaded onto the workers in the same directory as the jar is placed, which is why simply passing the filename and not the absolute path to Source.fromFile works.

like image 78
Ton Torres Avatar answered Sep 21 '22 08:09

Ton Torres