Manually calling spark's garbage collection from pyspark

Question

I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I get Out of Memory Exceptions.

Since all my caches sum up to about 1 GB I thought that the problem lies in the garbage collection. I was able to run the python garbage collector manually by calling:

import gc
collected = gc.collect()
print "Garbage collector: collected %d objects." % collected

This has helped a little.

I have played with the settings of spark's GC according to this article, and have tried to compress the RDD and to change the serializer to Kyro. This had slowed down the processing and did not help much with the memory.

Since I know exactly when I have spare cpu cycles to call the GC, it could help my situation to know how to call it manually in the JVM.

crak · Accepted Answer

You never have to call manually the GC. If you had OOMException it's because there is no more memory available. You should look for memory leak, aka references you keep in your code. If you releases this references the JVM will make free space when needed.

Paul Brannan · Answer

I believe this will trigger a GC (hint) in the JVM:

spark.sparkContext._jvm.System.gc()

See also: How to force garbage collection in Java?

and: Java: How do you really force a GC using JVMTI's ForceGargabeCollection?

Manually calling spark's garbage collection from pyspark

Tags:

java

python

garbage-collection

apache-spark

pyspark

architectonic

2 Answers

crak

Paul Brannan

Recent Activity

Donate For Us

Manually calling spark's garbage collection from pyspark

Tags:

java

python

garbage-collection

apache-spark

pyspark

architectonic

2 Answers

crak

Paul Brannan

Related questions

Recent Activity

Donate For Us