How does Spark on Yarn store shuffled files?

Tags:

apache-spark

I'm performing a filter in Spark using Yarn and receiving the below error. Any help is appreciated, but my main question is about why the file is not found.

/hdata/10/yarn/nm/usercache/spettinato/appcache/application_1428497227446_131967/spark-local-20150708124954-aa00/05/merged_shuffle_1_343_1

It appears that Spark can't find a file that has been stored to HDFS after being shuffled.

Why is Spark accessing directory "/hdata/"? This directory does not exist in HDFS, is it supposed to be a local directory or an HDFS directory?
Can I configure the location where shuffled data is stored?

15/07/08 12:57:03 WARN TaskSetManager: Loss was due to java.io.FileNotFoundException
java.io.FileNotFoundException: /hdata/10/yarn/nm/usercache/spettinato/appcache/application_1428497227446_131967/spark-local-20150708124954-aa00/05/merged_shuffle_1_343_1 (No such file or directory)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
        at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
        at org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.Task.run(Task.scala:51)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

EDIT: I figured out some of this. The directory configured by spark.local.dir is the local directory used to store RDDs to disk as per http://spark.apache.org/docs/latest/configuration.html

303

asked Jul 08 '15 20:07

pettinato

2 Answers

I will suggest checking out the space left on your system. I'd say as Carlos that the task died, and that the reason is that spark could not write a shuffle file due to lack of space.

Try grepping java.io.IOException: No space left on device in the ./work directory of your workers.

199

answered Sep 27 '22 16:09

Bacon

Most likely answer is that the task died. For example from OutOfMemory or other exception.

answered Sep 27 '22 18:09

Carlos Rendon

Related questions
                            
                                Guava version while using spark-shell
                            
                                Spark Shell - __spark_libs__.zip does not exist
                            
                                Integrate key-value database with Spark
                            
                                What is spark.local.ip ,spark.driver.host,spark.driver.bindAddress and spark.driver.hostname?
                            
                                What does df.repartition with no column arguments partition on?
                            
                                Reading HDF5 files [closed]
                            
                                foldLeft or foldRight equivalent in Spark?
                            
                                How to match Dataframe column names to Scala case class attributes?
                            
                                What does stage mean in the spark logs?
                            
                                Spark Job running on Yarn Cluster java.io.FileNotFoundException: File does not exits , eventhough the file exits on the master node
                            
                                pyspark Do python processes on an executor node share broadcast variables in ram?
                            
                                cannot resolve xyz given input columns error when creating Spark dataset
                            
                                Creating indices for each group in Spark dataframe
                            
                                java.lang.NoClassDefFoundError: Could not initialize class when launching spark job via spark-submit in scala code
                            
                                multi-processing with spark(PySpark) [duplicate]
                            
                                How to manually set group.id and commit kafka offsets in spark structured streaming?
                            
                                Use of lit() in expr()
                            
                                How to set group.id for consumer group in kafka data source in Structured Streaming?
                            
                                Can SPARK use multicore properly?
                            
                                Pass array as an UDF parameter in Spark SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With