Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkContext.addFile vs spark-submit --files

Tags:

apache-spark

I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?

I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?

References I found about --files and for SparkContext.addFile.

like image 782
Abdullah Shaikh Avatar asked Aug 10 '16 17:08

Abdullah Shaikh


People also ask

What does SparkContext addFile do?

addFile. Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

What is the use of SparkContext in Apache spark?

A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Note: Only one SparkContext should be active per JVM.

Why do we use PySpark SparkFiles?

In Apache Spark, you can upload your files using sc. addFile (sc is your default SparkContext) and get the path on a worker using SparkFiles. get. Thus, SparkFiles resolve the paths to files added through SparkContext.

What is SparkFiles in PySpark?

PySpark provides the facility to upload your files using sc. addFile. We can also get the path of working directory using SparkFiles. get.


1 Answers

It depends whether your Spark application is running in client or cluster mode.

In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.

If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.

like image 148
gclaussn Avatar answered Sep 18 '22 15:09

gclaussn