SparkContext.addFile vs spark-submit --files

Tags:

apache-spark

I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?

I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?

References I found about --files and for SparkContext.addFile.

782

asked Aug 10 '16 17:08

Abdullah Shaikh

1 Answers

It depends whether your Spark application is running in client or cluster mode.

In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.

If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.

148

answered Sep 18 '22 15:09

gclaussn

Related questions
                            
                                Applying function to Spark Dataframe Column
                            
                                What is a glom?. How it is different from mapPartitions?
                            
                                Pyspark : forward fill with last observation for a DataFrame
                            
                                Read from a hive table and write back to it using spark sql
                            
                                pyspark parse fixed width text file
                            
                                Error while exploding a struct column in Spark
                            
                                In Spark API, What is the difference between makeRDD functions and parallelize function?
                            
                                Spark DataFrame and renaming multiple columns (Java)
                            
                                How do I order fields of my Row objects in Spark (Python)
                            
                                How to read streaming dataset once and output to multiple sinks?
                            
                                Difference between sc.textFile and spark.read.text in Spark
                            
                                Spark: Repartition strategy after reading text file
                            
                                How does Spark interoperate with CPython
                            
                                Scale(Normalise) a column in SPARK Dataframe - Pyspark
                            
                                Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark
                            
                                Addition of two RDD[mllib.linalg.Vector]'s
                            
                                How to deal with tasks running too long (comparing to others in job) in yarn-client?
                            
                                Spark Streaming get warn "replicated to only 0 peer(s) instead of 1 peers"
                            
                                Should we parallelize a DataFrame like we parallelize a Seq before training
                            
                                Package-private scope in Scala visible from Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With