I am using Spark 1.6.0. I want to pass some properties files like log4j.properties and some other customer properties file. I see that we can use --files but I also saw that there is a method addFile in SparkContext. I did prefer to use --files instead of programatically adding the files, assuming both the options are same ?
I did not find much documentation about --files, so is --files & SparkContext.addFile both options same ?
References I found about --files and for SparkContext.addFile.
addFile. Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Note: Only one SparkContext should be active per JVM.
In Apache Spark, you can upload your files using sc. addFile (sc is your default SparkContext) and get the path on a worker using SparkFiles. get. Thus, SparkFiles resolve the paths to files added through SparkContext.
PySpark provides the facility to upload your files using sc. addFile. We can also get the path of working directory using SparkFiles. get.
It depends whether your Spark application is running in client or cluster mode.
In client mode the driver (application master) is running locally and can access those files from your project, because they are available within the local file system. SparkContext.addFile should find your local files and work like expected.
If your application is running in cluster mode. The application is submitted via spark-submit. This means that your whole application is transfered to the Spark master or Yarn, which starts the driver (application master) within the cluster on a specific node and within an separated environment. This environment has no access to your local project directory. So all necessary files has to be transfered as well. This can be achieved with the --files option. The same concept applies to jar files (dependencies of your Spark application). In cluster mode, they need to be added with the --jars option to be available within the classpath of the application master. If you use PySpark there is a --py-files option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With