Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

array of structures, or structure of arrays?

Hmmm. I have a table which is an array of structures I need to store in Java. The naive don't-worry-about-memory approach says do this:

public class Record {
  final private int field1;
  final private int field2;
  final private long field3;
  /* constructor & accessors here */
}

List<Record> records = new ArrayList<Record>();

If I end up using a large number (> 106 ) of records, where individual records are accessed occasionally, one at a time, how would I figure out how the preceding approach (an ArrayList) would compare with an optimized approach for storage costs:

public class OptimizedRecordStore {
  final private int[] field1;
  final private int[] field2;
  final private long[] field3;

  Record getRecord(int i) { return new Record(field1[i],field2[i],field3[i]); }
  /* constructor and other accessors & methods */
}

edit:

  • assume the # of records is something that is changed infrequently or never
  • I'm probably not going to use the OptimizedRecordStore approach, but I want to understand the storage cost issue so I can make that decision with confidence.
  • obviously if I add/change the # of records in the OptimizedRecordStore approach above, I either have to replace the whole object with a new one, or remove the "final" keyword.
  • kd304 brings up a good point that was in the back of my mind. In other situations similar to this, I need column access on the records, e.g. if field1 and field2 are "time" and "position", and it's important for me to get those values as an array for use with MATLAB, so I can graph/analyze them efficiently.
like image 220
Jason S Avatar asked Jul 14 '09 14:07

Jason S


People also ask

What is structure and array of structure?

STRUCTURE. Array refers to a collection consisting of elements of homogeneous data type. Structure refers to a collection consisting of elements of heterogeneous data type. Array uses subscripts or “[ ]” (square bracket) for element access.

What is the meaning of array of structures?

An array of structure in C programming is a collection of different datatype variables, grouped together under a single name.

What is array of structure example?

Declaration of Array of Structure in C The struct keyword is used to create a structure. Here is an example: struct student { char name[50]; char class[100]; int roll_number; float marks[5]; };


3 Answers

The answers that give the general "optimise when you have to" is unhelpful in this case because , IMHO, programmers should always be aware of the performance in different in design choices when that choice leads to an order of magnitude performance penalty, particularly API writers.

The original question is quite valid and I would tend to agree that the second approach is better, given his particular situation. I've written image processing code where each pixel requires a data structure, a situation not too dissimilar to this, except I needed frequent random access to each pixel. The overhead of creating one object for each pixel was enormous.

like image 184
flux Avatar answered Oct 11 '22 18:10

flux


The second version is much, much worse. Instead of resizing one array, you're resizing three arrays when you do an insert or delete. What's more, the second version will lead to the creation of many more temporary objects and it will do so on accesses. That could lead to a lot of garbage (from a GC point of view). Not good.

Generally speaking, you should worry about how you use the objects long before you think about performance. So you have a record with three fields or three arrays. Which one more accurately depicts what you're modeling? By this I mean, when you insert or delete an item, are you doing one of the three arrays or all three as a block?

I suspect it's the latter in which case the former makes far more sense.

If you're really concerned about insertion/deletion performance then perhaps a different data structure is appropriate, perhaps a SortedSet or a Map or SortedMap.

like image 34
cletus Avatar answered Oct 11 '22 19:10

cletus


If you have millions of records, the second approach has several advantages:

  • Memory usage: the first approach uses more memory because a) every Java object in heap has a header (containing class id, lock state etc.); b) objects are aligned in memory; c) each reference to an object costs 4 bytes (on 64-bit JVMs with Compressed OOPs or 32-bit JVMs) or 8 bytes (64-bit JVMs without Compressed OOPs). See e. g. CompressedOops for more details. So the first approach takes about two times more memory (more precisely: according to my benchmark, an object with 16 bytes of payload + a reference to it took 28 bytes on 32-bit Java 7, 36 bytes on 64-bit Java 7 with compressed OOPs, and 40 bytes on 64-bit Java 7 w/o compressed OOPs).
  • Garbage collection: although the second approach seems to create many objects (one on each call of getRecord), it might not be so, as modern server JVMs (e. g. Oracle's Java 7) can apply escape analysis and stack allocation to avoid heap allocation of temporary objects in some cases; anyway, GCing short-lived objects is cheap. On the other hand, it is probably easier for the garbage collector if there are not millions of long-lived objects (as there are in the first approach) whose reachability to check (or at least, such objects may make your application need more careful tuning of GC generation sizes). Thus the second approach may be better for GC performance. However, to see whether it makes a difference in the real situation, one should make a benchmark oneself.
  • Serialization speed: the speed of (de)serializing a large array of primitives on disk is only limited by HDD speed; serializing many small objects is inevitably slower (especially if you use Java's default serialization).

Therefore I have used the second approach quite often for very large collections. But of course, if you have enough memory and don't care about serialization, the first approach is simpler.

like image 5
Jaan Avatar answered Oct 11 '22 18:10

Jaan