I'm writing a simple spark application that uses some input RDD, sends it to an external script via pipe, and writes an output of that script to a file. Driver code looks like this: <pre class="prettyprint"><code>val input = args(0) val scriptPath = args(1) val output = args(2) val sc = getSparkContext if (args.length == 4) { //Here I pass an additional argument which contains an absolute path to a script on my local machine, only for local testing sc.addFile(args(3)) } sc.textFile(input).pipe(Seq("python2", SparkFiles.get(scriptPath))).saveAsTextFile(output) </code></pre> When I run it on my local machine it works fine. But when I submit it to a YARN cluster via <pre class="prettyprint"><code>spark-submit --master yarn --deploy-mode cluster --files /absolute/path/to/local/test.py --class somepackage.PythonLauncher path/to/driver.jar path/to/input/part-* test.py path/to/output` </code></pre> it fails with an exception. <pre class="prettyprint"><code>Lost task 1.0 in stage 0.0 (TID 1, rwds2.1dmp.ru): java.lang.Exception: Subprocess exited with status 2 </code></pre> I've tried different variations of the pipe command. For instance, <code>.pipe("cat")</code> works fine, and behaves as expected, but <code>.pipe(Seq("cat", scriptPath))</code> also fails with error code 1, so it seems that spark can't figure out a path to the script on a cluster node. Any suggestions?

I don't use python myself but I find some clues may be useful for you (in the source code of <code>Spark-1.3</code> SparkSubmitArguments) <ul> <li><code>--py-files PY_FILES</code>, Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.</li> <li><code>--files FILES</code>, Comma-separated list of files to be placed in the working directory of each executor.</li> <li><code>--archives ARCHIVES</code>, Comma separated list of archives to be extracted into the working directory of each executor.</li> </ul> And also, your arguments to <code>spark-submit</code> should follow this style: <code>Usage: spark-submit [options] <app jar | python file> [app arguments]</code>

Running Spark jobs on a YARN cluster with additional files

Tags:

apache-spark

hadoop-yarn

hdfs

I'm writing a simple spark application that uses some input RDD, sends it to an external script via pipe, and writes an output of that script to a file. Driver code looks like this:

val input = args(0)
val scriptPath = args(1)
val output = args(2)
val sc = getSparkContext
if (args.length == 4) {
  //Here I pass an additional argument which contains an absolute path to a script on my local machine, only for local testing
  sc.addFile(args(3))
}

sc.textFile(input).pipe(Seq("python2", SparkFiles.get(scriptPath))).saveAsTextFile(output)

When I run it on my local machine it works fine. But when I submit it to a YARN cluster via

spark-submit --master yarn --deploy-mode cluster --files /absolute/path/to/local/test.py --class somepackage.PythonLauncher path/to/driver.jar path/to/input/part-* test.py path/to/output`

it fails with an exception.

Lost task 1.0 in stage 0.0 (TID 1, rwds2.1dmp.ru): java.lang.Exception: Subprocess exited with status 2

I've tried different variations of the pipe command. For instance, .pipe("cat") works fine, and behaves as expected, but .pipe(Seq("cat", scriptPath)) also fails with error code 1, so it seems that spark can't figure out a path to the script on a cluster node.

Any suggestions?

850

asked May 05 '15 08:05

Alexander Tokarev

2 Answers

I don't use python myself but I find some clues may be useful for you (in the source code of Spark-1.3 SparkSubmitArguments)

--py-files PY_FILES, Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.
--files FILES, Comma-separated list of files to be placed in the working directory of each executor.
--archives ARCHIVES, Comma separated list of archives to be extracted into the working directory of each executor.

And also, your arguments to spark-submit should follow this style:

Usage: spark-submit [options] <app jar | python file> [app arguments]

159

answered Nov 07 '22 05:11

yjshen

To understand why, you must get familiar with the differences of the three running mode of spark, eg. standalone, yarn-client, yarn-cluster.

As with standalone and yarn-client, driver program runs at the current location of your local machine while worker program runs somewhere else(standalone maybe another temp directory under $SPARK_HOME, yarn-client maybe a random node in the cluster), so you can access local file with local path specified in the driver program but not in the worker program.

However, when you run with yarn-cluster mode, both your driver and worker program run at a random cluster node, local files are relative to their working machine and directory, thereby a file-not-found exception throws, you need to archive these files with either --files or --archive when submitting, or just archive them in .egg or .jar yourself before submit, or use addFile api in your driver program like this.

answered Nov 07 '22 06:11

kuixiong

Related questions
                            
                                How to retrieve Metrics like Output Size and Records Written from Spark UI?
                            
                                How does computing table stats in hive or impala speed up queries in Spark SQL?
                            
                                Spark Shuffle - How workers know where to pull data from
                            
                                pyspark csv at url to dataframe, without writing to disk
                            
                                Spark: Order of column arguments in repartition vs partitionBy
                            
                                Spark Streaming Accumulated Word Count
                            
                                Saving to parquet subpartition
                            
                                How do I apply schema with nullable = false to json reading
                            
                                Why does the Spark DataFrame conversion to RDD require a full re-mapping?
                            
                                PySpark distributed processing on a YARN cluster
                            
                                How do I visualise / plot a decision tree in Apache Spark (PySpark 1.4.1)?
                            
                                Where does spark look for text files?
                            
                                Spark DataFrame InsertIntoJDBC - TableAlreadyExists Exception
                            
                                How to pass data from Kafka to Spark Streaming?
                            
                                Spark Driver Memory and Executor Memory
                            
                                Retain keys with null values while writing JSON in spark
                            
                                How to detect Databricks environment programmatically
                            
                                Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons"
                            
                                How to convert spark SchemaRDD into RDD of my case class?
                            
                                "No Filesystem for Scheme: gs" when running spark job locally

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With