Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Manually calling spark's garbage collection from pyspark

I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions.

Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling:

import gc
collected = gc.collect()
print "Garbage collector: collected %d objects." % collected

This has helped a little.

I have played with the settings of spark's GC according to this article, and have tried to compress the RDD and to change the serializer to Kyro. This had slowed down the processing and did not help much with the memory.

Since I know exactly when I have spare cpu cycles to call the GC, it could help my situation to know how to call it manually in the JVM.

like image 965
architectonic Avatar asked Nov 13 '15 09:11

architectonic


2 Answers

You never have to call manually the GC. If you had OOMException it's because there is no more memory available. You should look for memory leak, aka references you keep in your code. If you releases this references the JVM will make free space when needed.

like image 174
crak Avatar answered Sep 20 '22 07:09

crak


I believe this will trigger a GC (hint) in the JVM:

spark.sparkContext._jvm.System.gc()

See also: How to force garbage collection in Java?

and: Java: How do you really force a GC using JVMTI's ForceGargabeCollection?

like image 40
Paul Brannan Avatar answered Sep 22 '22 07:09

Paul Brannan