Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between --archives, --files, py-files in pyspark job arguments

Tags:

--archives, --files, --py-files and sc.addFile and sc.addPyFile are quite confusing, can someone explain these clearly?

like image 325
JasonWayne Avatar asked Jun 28 '16 02:06

JasonWayne


People also ask

How do I run a py file in PySpark?

Run PySpark Application from spark-submit py file you wanted to run and you can also specify the . py, . egg, . zip file to spark submit command using --py-files option for any dependencies.

What is the difference between spark submit and PySpark?

pyspark is a REPL similar to spark-shell for Python language. spark-submit is used to submit Spark application on cluster.


1 Answers

These options are truly scattered all over the place.

In general, add your data files via --files or --archives and code files via --py-files. The latter will be added to the classpath (c.f., here) so you could import and use.

As you can imagine, the CLI arguments is actually dealt with by addFile and addPyFiles functions (c.f., here)

  • From http://spark.apache.org/docs/latest/programming-guide.html

Behind the scenes, pyspark invokes the more general spark-submit script.

You can add Python .zip, .egg or .py files to the runtime path by passing a comma-separated list to --py-files

  • From http://spark.apache.org/docs/latest/running-on-yarn.html

The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.

  • From http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=addpyfile#pyspark.SparkContext.addPyFile

addFile(path) Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

addPyFile(path) Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

like image 116
shuaiyuancn Avatar answered Sep 29 '22 07:09

shuaiyuancn