I'm relatively new to Scala.
If I have a construct like this,
sampleFile.map(line => line.map {
var myObj = new MyClass(word);
myObj.func();
})
I create an object of MyClass
and do something inside a class method (func()
). I repeat this for all the lines in a file (through map
). So, I create an object at every step of my iteration (for every line). The scope of myObj
will be void when I start next iteration (will they be destroyed at the end of the block, or will they be orphaned out in memory?). My doubt is when does the garbage collection triggered? Also, is it expensive to create an object at every step of the iteration? Does this have any performance implication when the number of lines increases to 1 million?
Garbage collection is the responsibility of the JVM, not Scala. So the precise details depend on which JVM you're running. There is no defined time at which garbage collection is triggered; the JVM tries to do it when it is opportune or necessary.
gc() is that it is inefficient. And in the worst case, it is horribly inefficient! Let me explain. A typical GC algorithm identifies garbage by traversing all non-garbage objects in the heap, and inferring that any object not visited must be garbage.
Garbage Collection Spark runs on the Java Virtual Machine (JVM). Because Spark can store large amounts of data in memory, it has a major reliance on Java's memory management and garbage collection (GC). Therefore, garbage collection (GC) can be a major issue that can affect many Spark applications.
The heap is created when the JVM starts up and may increase or decrease in size while the application runs. When the heap becomes full, garbage is collected. During the garbage collection objects that are no longer used are cleared, thus making space for new objects.
Your objects should all get garbage collected fairly quickly (assuming myObj.func()
does not store a pointer to myObj somewhere else...). On the JVM, any unreferenced objects should get garbage collected - and your last reference to the new object disappears as soon as myObj
goes out of scope.
Garbage collection of short-lived objects is generally very cheap and efficient, so you probably shouldn't worry about it (at least until you have benchmarks / measured performance problems that prove otherwise....)
In particular, since you appear to be doing IO (reading from a sample file?) then I expect the overhead of GC is negligible compared to the cost of your disk IO operations.
Garbage collection is the responsibility of the JVM, not Scala. So the precise details depend on which JVM you're running. There is no defined time at which garbage collection is triggered; the JVM tries to do it when it is opportune or necessary.
Someone more knowledgeable than me on the subject of GC algorithms and JVM tuning could probably give you some concrete explanation to address your performance concerns, but in general I'd say you should just trust that JVMs are pretty good at garbage collecting "intelligently".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With