I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions.
Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling:
import gc
collected = gc.collect()
print "Garbage collector: collected %d objects." % collected
This has helped a little.
I have played with the settings of spark's GC according to this article, and have tried to compress the RDD and to change the serializer to Kyro. This had slowed down the processing and did not help much with the memory.
Since I know exactly when I have spare cpu cycles to call the GC, it could help my situation to know how to call it manually in the JVM.
You never have to call manually the GC. If you had OOMException it's because there is no more memory available. You should look for memory leak, aka references you keep in your code. If you releases this references the JVM will make free space when needed.
I believe this will trigger a GC (hint) in the JVM:
spark.sparkContext._jvm.System.gc()
See also: How to force garbage collection in Java?
and: Java: How do you really force a GC using JVMTI's ForceGargabeCollection?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With