What's the difference between --archives, --files, py-files in pyspark job arguments

1 Answers

These options are truly scattered all over the place.

In general, add your data files via --files or --archives and code files via --py-files. The latter will be added to the classpath (c.f., here) so you could import and use.

As you can imagine, the CLI arguments is actually dealt with by addFile and addPyFiles functions (c.f., here)

From http://spark.apache.org/docs/latest/programming-guide.html

Behind the scenes, pyspark invokes the more general spark-submit script.

You can add Python .zip, .egg or .py files to the runtime path by passing a comma-separated list to --py-files

From http://spark.apache.org/docs/latest/running-on-yarn.html

The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.

From http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=addpyfile#pyspark.SparkContext.addPyFile

addFile(path) Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

addPyFile(path) Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

116

answered Sep 29 '22 07:09

shuaiyuancn

Related questions
                            
                                Why does Google.Pubsub.V1 beta01 not work with dotnet cli projects?
                            
                                Parsers and Formatters in Angular2
                            
                                Can I use DeepZoomTools.dll in my project and Nuget package?
                            
                                When shall we set FirebaseUser.getToken(Boolean forcerefresh) to true?
                            
                                How to pass parameters to a .NET core project with dockerfile
                            
                                Mark protocol method as deprecated
                            
                                Why can't a malicious site obtain a CSRF token via GET before attacking?
                            
                                How can we use serverless.yml to create an AWS S3 bucket and add a file to it?
                            
                                How to use typescript in browser?
                            
                                .NET Core - Web API - How to do File Upload?
                            
                                Intermittent PHP Abstract Class Error
                            
                                How do I use importlib.LazyLoader?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between --archives, --files, py-files in pyspark job arguments

Tags:

JasonWayne

People also ask

1 Answers

shuaiyuancn

Recent Activity

Donate For Us