I am beginner to Spark and I am running my application to read 14KB data from text filed, do some transformations and actions(collect, collectAsMap) and save data to Database
I am running it locally in my macbook with 16G memory, with 8 logical cores.
Java Max heap is set at 12G.
Here is the command I use to run the application.
bin/spark-submit --class com.myapp.application --master local[*] --executor-memory 2G --driver-memory 4G /jars/application.jar
I am getting the following warning
2017-01-13 16:57:31.579 [Executor task launch worker-8hread] WARN org.apache.spark.storage.MemoryStore - Not enough space to cache rdd_57_0 in memory! (computed 26.4 MB so far)
Can anyone guide me on what is going wrong here and how can I improve performance? Also how to optimize on suffle-spill ? Here is a view of the spill that happens in my local system
1 Answer. Executors are worker nodes' processes in charge of running individual tasks in a given Spark job and The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.
Determine the memory resources available for the Spark application. Multiply the cluster RAM size by the YARN utilization percentage. Provides 5 GB RAM for available drivers and 50 GB RAM available for worker nodes. Discount 1 core per worker node to determine the executor core instances.
An executor is a process that is launched for a Spark application on a worker node. Each executor memory is the sum of yarn overhead memory and JVM Heap memory. JVM Heap memory comprises of: RDD Cache Memory. Shuffle Memory.
Running executors with too much memory often results in excessive garbage collection delays. So it is not a good idea to assign more memory. Since you have only 14KB data 2GB executors memory and 4GB driver memory is more than enough. There is no use of assigning this much memory. You can run this job with even 100MB memory and performance will be better then 2GB.
Driver memory are more useful when you run the application, In yarn-cluster mode, because the application master runs the driver. Here you are running your application in local mode driver-memory
is not necessary. You can remove this configuration from you job.
In your application you have assigned
Java Max heap is set at: 12G.
executor-memory: 2G
driver-memory: 4G
Total memory allotment= 16GB and your macbook having 16GB only memory. Here you have allocated total of your RAM memory to your spark application.
This is not good. Operating system itself consume approx 1GB memory and you might have running other applications which also consume the RAM memory. So here you are actually allocating more memory then you have. And this is the root cause that your application is throwing error Not enough space to cache the RDD
executor-memory 1G
or lessdriver-memory
from your configuration.Submit your job. It will run smoothly.
If you are very keen to know spark memory management techniques, refer this useful article.
Spark on yarn executor resource allocation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With