Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Same program, same JVM, but vastly different memory requirements and execution time on different machines - why?

I'm trying to run a NetLogo (a java simulation framework) simulation on a cluster as part of a large experiment. I was surprised at the seemingly massive memory requirement of a (relatively) simple simulation. On the cluster it throws "java.lang.OutOfMemoryError: Java heap space" exceptions for anything less than "-Xmx2500M" heapsizes. A single execution takes 5 hours to run. I ran the same experiment on both my Macs (iMac and MacBook Pro), and they executed in less than one hour, with "-Xmx1024" giving no errors. The cluster jobs require "-XX:MaxPermSize=250M" whereas on my Macs no increase above default is required. I ran the same code, the same inputs, using the exact same jars in all cases.

64 bit JVMs are used in each case (and as far as I know these are pretty similar):

<on the cluster>
$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

<on my macs>
$ java -version
java version "1.6.0_31"
Java(TM) SE Runtime Environment (build 1.6.0_31-b04-415-10M3646)
Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01-415, mixed mode)

And I am running Client JVM in all cases (was initially using Server on cluster, switching to client made no difference). I have tried executing on the cluster with java 7, same huge memory and execution time issues.

I am completely perplexed, no one I have spoken to can explain this. Has anyone out there come across this before? Any help greatly appreciated!

like image 785
user1660640 Avatar asked Sep 10 '12 16:09

user1660640


1 Answers

I suspect one has faster network or disk IO. If you are using queues to write to the disk or write to the network where one computer can keep up and but the other cannot, the queue might grow slowing the machine and using an unlimited amount of memory.

If you have faster network IO it can either help send data faster (keeping queues small), or it can mean you receive data too fast (meaning queue can grow faster than they are consumed)

A lot depends on what your application actually does. When your program gets an OOME I suggest you get a heap dump and analyse it and look for collections (e.g. queue) which are consuming a lot of memory.

like image 194
Peter Lawrey Avatar answered Oct 20 '22 09:10

Peter Lawrey