I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job always got stuck in repartitioning. Namely, the job will constantly get following errors and retries: <pre class="prettyprint"><code>org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-... </code></pre> I've tried to identify the problem but it seems like both memory and disk consumption of the machine throwing these errors are below 50%. I've also tried different configurations, including: <pre class="prettyprint"><code>let driver/executor memory use 60% of total memory. let netty to priortize JVM shuffling buffer. increase shuffling streaming buffer to 128m. use KryoSerializer and max out all buffers increase shuffling memoryFraction to 0.4 </code></pre> But none of them works. The small job always trigger the same series of errors and max out retries (upt to 1000 times). How to troubleshoot this thing in such situation? Thanks a lot if you have any clue.

Check your log if you get an error similar to this. <pre class="prettyprint"><code>ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated </code></pre> Every time you get this error is because you lose an executor. As why you lost an executor, that is another story, again check your log for clues. One thing Yarn can kill your job, if it thinks that see you are using "too much memory" Check for something like this: <pre class="prettyprint"><code>org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container. </code></pre> Also see: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html <blockquote> The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic. </blockquote>

I was also getting error <pre class="prettyprint"><code>org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle </code></pre> and looking further in log I found <pre class="prettyprint"><code>Container killed on request. Exit code is 143 </code></pre> After searching for the exit code, I realized that's its mainly related to memory allocation. So I checked the amount of memory I have configured for executors. I found that by mistake I had configured 7g to driver and only 1g for executor. After increasing the memory of executor my spark job ran successfully.

What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

Tags:

memory-management

apache-spark

I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job always got stuck in repartitioning. Namely, the job will constantly get following errors and retries:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle

org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer

org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-...

I've tried to identify the problem but it seems like both memory and disk consumption of the machine throwing these errors are below 50%. I've also tried different configurations, including:

let driver/executor memory use 60% of total memory.
let netty to priortize JVM shuffling buffer.
increase shuffling streaming buffer to 128m.
use KryoSerializer and max out all buffers
increase shuffling memoryFraction to 0.4

But none of them works. The small job always trigger the same series of errors and max out retries (upt to 1000 times). How to troubleshoot this thing in such situation?

Thanks a lot if you have any clue.

451

asked Apr 24 '15 14:04

tribbloid

2 Answers

Check your log if you get an error similar to this.

ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated

Every time you get this error is because you lose an executor. As why you lost an executor, that is another story, again check your log for clues.

One thing Yarn can kill your job, if it thinks that see you are using "too much memory"

Check for something like this:

org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl  - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container.

Also see: http://apache-spark-developers-list.1001551.n3.nabble.com/Lost-executor-on-YARN-ALS-iterations-td7916.html

The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic.

answered Oct 19 '22 12:10

LabOctoCat

I was also getting error

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle

and looking further in log I found

Container killed on request. Exit code is 143

After searching for the exit code, I realized that's its mainly related to memory allocation. So I checked the amount of memory I have configured for executors. I found that by mistake I had configured 7g to driver and only 1g for executor. After increasing the memory of executor my spark job ran successfully.

answered Oct 19 '22 13:10

Dharmendra Chouhan

Related questions
                            
                                Getting most derived type during object construction
                            
                                iOS basic memory management
                            
                                Is there a C++ allocator that prevent an STL container from being swapped?
                            
                                MySQL NOT using the available memory
                            
                                Get total RAM size that use Application in Android
                            
                                How do you stop a thread and flush its registers into the stack?
                            
                                Where to initialize the data structures : init or viewDidLoad?
                            
                                Default memory cache with LRU policy
                            
                                Different addresses in ELF header and process virtual memory
                            
                                Android - get allocated memory
                            
                                Out of Memory Concepts
                            
                                Estimating memory usage of Cocos2d game
                            
                                Unexplained growth in chrome private memory
                            
                                Difference between return values of alloc_pages() and get_free_pages()
                            
                                How does PHP copy() handle memory
                            
                                Python multiprocess with pool workers - memory use optimization
                            
                                iOS 7 memory issues
                            
                                Does C strict aliasing make untyped static memory pools impossible?
                            
                                vm/min_free_kbytes - Why Keep Minimum Reserved Memory?
                            
                                Difference between Cache and Translation LookAside Buffer[TLB]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With