In managed code, how do I achieve good locality of reference?

Question

Since RAM seems to be the new disk, and since that statement also means that access to memory is now considered slow similarly to how disk access has always been, I do want to maximize locality of reference in memory for high performance applications. For example, in a sorted index, I want adjacent values to be close (unlike say, in a hashtable), and I want the data the index is pointing to close by, too.

In C, I can whip up a data structure with a specialized memory manager, like the developers of the (immensely complex) Judy array did. With direct control over the pointers, they even went so far as to encode additional information in the pointer value itself. When working in Python, Java or C#, I am deliberately one (or more) level(s) of abstraction away from this type of solution and I'm entrusting the JIT compilers and optimizing runtimes with doing clever tricks on the low levels for me.

Still, I guess, even at this high level of abstraction, there are things that can be semantically considered "closer" and therefore are likely to be actually closer at the low levels. For example, I was wondering about the following (my guess in parentheses):

Can I expect an array to be an adjacent block of memory (yes)?
Are two integers in the same instance closer than two in different instances of the same class (probably)?
Does an object occupy a contigous region in memory (no)?
What's the difference between an array of objects with only two int fields and a single object with two int[] fields? (this example is probably Java specific)

I started wondering about these in a Java context, but my wondering has become more general, so I'd suggest to not treat this as a Java question.

Jon Skeet · Accepted Answer

In .NET, elements of an array are certainly contiguous. In Java I'd expect them to be in most implementations, but it appears not to be guaranteed.
I think it's reasonable to assume that the memory used by an instance for fields is in a single block... but don't forget that some of those fields may be references to other objects.

For the Java array part, Sun's JNI documentation includes this comment, tucked away in a discussion about strings:

For example, the Java virtual machine may not store arrays contiguously.

For your last question, if you have two int[] then each of those arrays will be a contiguous block of memory, but they could be very "far apart" in memory. If you have an array of objects with two int fields, then each object could be a long way from each other, but the two integers within each object will be close together. Potentially more importantly, you'll end up taking a lot more memory with the "lots of objects" solution due to the per-object overhead. In .NET you could use a custom struct with two integers instead, and have an array of those - that would keep all the data in one big block.

I believe that in both Java and .NET, if you allocate a lot of smallish objects in quick succession within a single thread then those objects are likely to have good locality of reference. When the GC compacts a heap, this may improve - or it may potentially become worse, if a heap with

A B C D E

is compacted to

A D E B

(where C is collected) - suddenly A and B, which may have been "close" before, are far apart. I don't know whether this actually happens in any garbage collector (there are loads around!) but it's possible.

Basically in a managed environment you don't usually have as much control over locality of reference as you do in an unmanaged environment - you have to trust that the managed environment is sufficiently good at managing it, and that you'll have saved enough time by coding to a higher level platform to let you spend time optimising elsewhere.

JesperE · Answer

First, your title is implying C#. "Managed code" is a term coined by Microsoft, if I'm not mistaken.

Java primitive arrays are guaranteed to be a continuous block of memory. If you have a

int[] array = new int[4];

you can from JNI (native C) get a int *p to point to the actual array. I think this goes for the Array* class of containers as well (ArrayList, ArrayBlockingQueue, etc).

Early implementations of the JVM had objects as contiuous struct, I think, but this cannot be assumed with newer JVMs. (JNI abstracts away this).

Two integers in the same object will as you say probably be "closer", but they may not be. This will probably vary even using the same JVM.

An object with two int fields is an object and I don't think any JVM makes any guarantee that the members will be "close". An int-array with two elements will very likely be backed by a 8 byte long array.

Dzmitry Huba · Answer

With regards to arrays here is an excerpt from CLI (Common Language Infrastructure) specification:

Array elements shall be laid out within the array object in row-major order (i.e., the elements associated with the rightmost array dimension shall be laid out contiguously from lowest to highest index). The actual storage allocated for each array element can include platform-specific padding. (The size of this storage, in bytes, is returned by the sizeof instruction when it is applied to the type of that array’s elements.

Kylotan · Answer

Good question! I think I would resort to writing extensions in C++ that handle memory in a more carefully managed way and just exposing enough of an interface to allow the rest of the application to manipulate the objects. If I was that concerned about performance I would probably resort to a C++ extension anyway.

In managed code, how do I achieve good locality of reference?

Tags:

java

python

memory-management

c#

optimization

Hanno Fietz

4 Answers

Jon Skeet

JesperE

Dzmitry Huba

Kylotan

Recent Activity

Donate For Us

In managed code, how do I achieve good locality of reference?

Tags:

java

python

memory-management

c#

optimization

Hanno Fietz

4 Answers

Jon Skeet

JesperE

Dzmitry Huba

Kylotan

Related questions

Recent Activity

Donate For Us