Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark driver pod getting killed with 'OOMKilled' status

We are running a Spark Streaming application on a Kubernetes cluster using spark 2.4.5. The application is receiving massive amounts of data through a Kafka topic (one message each 3ms). 4 executors and 4 kafka partitions are being used.

While running, the memory of the driver pod keeps increasing until it is getting killed by K8s with an 'OOMKilled' status. The memory of executors is not facing any issues.

When checking the driver pod resources using this command :

kubectl top pod podName

We can see that the memory increases until it reaches 1.4GB, and the pod is getting killed.

However, when checking the storage memory of the driver on Spark UI, we can see that the storage memory is not fully used (50.3 KB / 434 MB). Is there any difference between the storage memory of the driver, and the memory of the pod containing the driver ?

Has anyone had experience with a similar issue before?

Any help would be appreciated.

Here are few more details about the app :

  • Kubernetes version : 1.18
  • Spark version : 2.4.5
  • Batch interval of spark streaming context : 5 sec
  • Rate of input data : 1 kafka message each 3 ms
  • Scala language
like image 774
Nab Avatar asked Aug 14 '20 11:08

Nab


People also ask

What is OOMKilled?

The OOMKilled: Limit Overcommit error can occur when the sum of pod limits is greater than the available memory on the node.

How do I check my spark driver memory?

Determine the memory resources available for the Spark application. Multiply the cluster RAM size by the YARN utilization percentage. Provides 5 GB RAM for available drivers and 50 GB RAM available for worker nodes.


1 Answers

In brief, the Spark memory consists of three parts:

  • Reversed memory (300MB)
  • User memory ((all - 300MB)*0.4), used for data processing logic.
  • Spark memory ((all-300MB)*0.6(spark.memory.fraction)), used for cache and shuffle in Spark.

Besides this, there is also max(executor memory * 0.1, 384MB)(0.1 is spark.kubernetes.memoryOverheadFactor) extra memory used by non-JVM memory in K8s.

Adding executor memory limit by memory overhead in K8S should fix the OOM.

You can also decrease spark.memory.fraction to allocate more RAM to user memory.

like image 133
Hunger Avatar answered Oct 27 '22 13:10

Hunger