While running a spark job with a Kubernetes cluster, we get the following error:
2018-11-30 14:00:47 INFO DAGScheduler:54 - Resubmitted ShuffleMapTask(1, 58), so marking it as still running.
2018-11-30 14:00:47 WARN TaskSetManager:66 - Lost task 310.0 in stage 1.0 (TID 311, 10.233.71.29, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason:
The executor with id 3 exited with exit code -1.
The API gave the following brief reason: Evicted
The API gave the following message: The node was low on resource: ephemeral-storage. Container executor was using 515228Ki, which exceeds its request of 0.
The API gave the following container statuses:
How to configure the job so we can increase the ephemeral storage size of each container ?
We use spark 2.4.0 and Kubernetes 1.12.1
The spark submit option is as follow
--conf spark.local.dir=/mnt/tmp \
--conf spark.executor.instances=4 \
--conf spark.executor.cores=8 \
--conf spark.executor.memory=100g \
--conf spark.driver.memory=4g \
--conf spark.driver.cores=1 \
--conf spark.kubernetes.memoryOverheadFactor=0.1 \
--conf spark.kubernetes.container.image=spark:2.4.0 \
--conf spark.kubernetes.namespace=visionlab \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.myvolume.options.claimName=pvc \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.myvolume.mount.path=/mnt/ \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.myvolume.mount.readOnly=false \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.myvolume.options.claimName=pvc \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.myvolume.mount.path=/mnt/ \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.myvolume.mount.readOnly=false
Cluster administrators can manage ephemeral storage within a project by setting quotas that define the limit ranges and number of requests for ephemeral storage across all pods in a non-terminal state. Developers can also set requests and limits on this compute resource at the pod and container level.
Therefore, the Pod requests a total of 10GiB (8GiB+2GiB) of local ephemeral storage and enforces a limit of 12GiB of local ephemeral storage. It also sets emptyDir sizeLimit to 5GiB. With this setting in pod spec, it will affect how the scheduler makes a decision on scheduling pods and also how kubelet evict pods.
Kubernetes supports two volume types — persistent and ephemeral — for different use cases. While persistent volumes retain data irrespective of a pod's lifecycle, ephemeral volumes last only for the lifetime of a pod and are deleted as soon as the pod terminates.
As @Rico says, there's no way to set ephemeral storage limits via driver configurations as of spark 2.4.3. Instead, you can set ephemeral storage limits for all new pods in your namespace using a LimitRange:
apiVersion: v1
kind: LimitRange
metadata:
name: ephemeral-storage-limit-range
spec:
limits:
- default:
ephemeral-storage: 8Gi
defaultRequest:
ephemeral-storage: 1Gi
type: Container
This applies the defaults to executors created in the LimitRange's namespace:
$ kubectl get pod spark-kub-1558558662866-exec-67 -o json | jq '.spec.containers[0].resources.requests."ephemeral-storage"'
"1Gi"
It's a little heavy-handed because it applies the default to all containers in your namespace, but it may be a solution if your workload is uniform.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With