I sometimes see I see the following error message when running Spark jobs:
13/10/21 21:27:35 INFO cluster.ClusterTaskSetManager: Loss was due to spark.SparkException: File ./someJar.jar exists and does not match contents of ...
What does this mean? How do I diagnose and fix this?
After digging around in the logs I found "no space left on device" exceptions too, then when I ran df -h
and df -i
on every node I found a partition that was full. Interestingly this partition does not appear to be used for data, but storing jars temporarily. It's name was something like /var/run
or /run
.
The solution was to clean the partition of old files and to setup some automated cleaning, I think setting spark.cleaner.ttl
to say a day (86400) should prevent it happening again.
Running on AWS EC2 I periodically encounter disk space issues - even after setting the spark.cleaner.ttl
to a few hours (we iterate quickly). I decided to solve them by moving the /root/spark/work
directory to the mounted ephemeral disk on the instance (I'm using r3.larges which have a 32GB ephemeral at /mnt
):
readonly HOST=some-ec2-hostname-here
ssh -t root@$HOST spark/sbin/stop-all.sh
ssh -t root@$HOST "for SLAVE in \$(cat /root/spark/conf/slaves) ; do ssh \$SLAVE 'rm -rf /root/spark/work && mkdir /mnt/work && ln -s /mnt/work /root/spark/work' ; done"
ssh -t root@$HOST spark/sbin/start-all.sh
As far as I can tell as of Spark 1.5 the work directory still does not make use of the mounted storage by default. I haven't tinkered with the deployment settings enough to see if this is even configurable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With