Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

refactoring Java arrays and primitives (double[][]) to Collections and Generics (List<List<Double>>)

I have been refactoring throwaway code which I wrote some years ago in a FORTRAN-like style. Most of the code is now much more organized and readable. However the heart of the algorithm (which is performance-critical) uses 1- and 2-dimensional Java arrays and is typified by:

    for (int j = 1; j < len[1]+1; j++) {
        int jj = (cont == BY_TYPE) ? seq[1][j-1] : j-1;
        for (int i = 1; i < len[0]+1; i++) {
            matrix[i][j] = matrix[i-1][j] + gap;
            double m = matrix[i][j-1] + gap;
            if (m > matrix[i][j]) {
                matrix[i][j] = m;
                pointers[i][j] = UP;
            }
            //...
        }
    }

For clarity, maintainability and interfacing with the rest of the code I would like to refactor it. However on reading Java Generics Syntax for arrays and Java Generics and numbers I have the following concerns:

  • Performance. The code is planned to use about 10^8 - 10^9 secs/yr and this is just about manageable. My reading suggests that changing double to Double can sometimes add a factor of 3 in performance. I'd like other experience on this. I would also expect that moving from foo[] to List would be a hit as well. I have no first-hand knowledge and again experience would be useful.

  • Array-bound checking. Is this treated differently in double[] and List and does it matter? I expect some problems to violate bounds as the algorithm is fairly simple and has only been applied to a few data sets.

  • If I don't refactor then the code has an ugly and possibly fragile intermixture of the two approaches. I am already trying to write things such as:

    List<double[]> and List<Double>[]

and understand that the erasure does not make this pretty and at best gives rise to compiler warnings. It seems difficult to do this without very convoluted constructs.

  • Obsolescence. One poster suggested that Java arrays should be obsoleted. I assume this isn't going to happen RSN but I would like to move away from outdated approaches.

SUMMARY The consensus so far:

  • Collections have a significant performance hit over primitive arrays, especially for constructs such as matrices. This is incurred in auto(un)boxing numerics and in accessing list items

  • For tight numerical (scientific) algorithms the array notation [][] is actually easier to read but the variables should named as helpfully as possible

  • Generics and arrays do not mix well. It may be useful to wrap the arrays in classes to transport them in/out of the tight algorithm.

There is little objective reason to make the change

QUESTION @SeanOwen has suggested that it would be useful to take constant values out of the loops. Assuming I haven't goofed this would look like:

 int len1 = len[1];
 int len0 = len[0];
 int seq1 = seq[1];
 int[] pointersi;
 double[] matrixi;
 for (int i = 1; i < len0+1; i++) {
     matrixi = matrix[i];
     pointersi = pointers[i];
 }
 for (int j = 1; j < len1+1; j++) {
    int jj = (cont == BY_TYPE) ? seq1[j-1] : j-1;
    for (int i = 1; i < len0+1; i++) {
        matrixi[j] = matrixi[j] + gap;
        double m = matrixi[j-1] + gap;
        if (m > matrixi[j]) {
            matrixi[j] = m;
            pointersi[j] = UP;
        }
        //...
    }
}

I thought compilers were meant to be smart at doing this sort of thing. Do we need to still do this?

like image 649
peter.murray.rust Avatar asked Sep 11 '09 07:09

peter.murray.rust


4 Answers

I fully agree with KLE's answer. Because the code is performance-critical, I'd keep the array based datastructures as well. And I believe, that just introducing collections, wrappers for primitive types and generics will not improve maintainability and clarity.

In addition, if this algorithm is the heart of the application and has been in use for several years now, chance are fairly low, that it will need maintenance like bug fixing or improvements.

For clarity, maintainability and interfacing with the rest of the code I would like to refactor it.

Instead of changing datastructures I'd concentrate on renaming and maybe moving some part of the code to private methods. From looking at the code, I have no idea what's happening, and the problem, as I see it, are the more or less short and technical variable and field names.

Just an example: one 2-dimensional array is just named 'matrix'. But it's obviously clear, that this is a matrix, so naming it 'matrix' is pretty redundant. It would be more helpful to rename it so that it becomes clear, what this matrix is really used for, what kind of data is inside.

Another candidate is your second line. With two refactorings, I'd rename 'jj' to something more meaningful and move the expression to a private method with a 'speaking' name.

like image 27
Andreas Dolk Avatar answered Oct 05 '22 10:10

Andreas Dolk


The general guideline is to prefer generified collections over arrays in Java, but it's only a guideline. My first thought would be to NOT change this working code. If you really want to make this change, then benchmark both approaches.

As you say, performance is critical, in which case the code that meets the needed performance is better than code that doesn't.

You might also run into auto-boxing issues when boxing/unboxing the doubles - a potentially more subtle problem.

The Java language guys have been very strict about keeping the JVM compatible across different versions so I don't see arrays going anywhere - and I wouldn't call them obsolete, just more primitive than the other options.

like image 34
SteveD Avatar answered Oct 05 '22 12:10

SteveD


Well I think that arrays are the best way to store process data in algorithms. Since Java doesn't support operator overloading (one of the reasons why I think arrays won't be obsolete that soon) switching to collections would make the code quite hard to read:

double[][] matrix = new double[10][10];
double t = matrix[0][0];

List<List<Double>> matrix = new ArrayList<List<Double>>(10);
Collections.fill(matrix, new ArrayList<Double>(10));
double t = matrix.get(0).get(0); // autoboxing => performance

As far as I know Java prestores some wrapper Object for Number instances (e.g. the first 100 integers), so that you can access them faster but I think that won't help much with that many data.

like image 40
Daff Avatar answered Oct 05 '22 12:10

Daff


I read an excellent book by Kent Beck on coding best-practices ( http://www.amazon.com/Implementation-Patterns/dp/B000XPRRVM ). There are also interesting performance figures. Specifically, there are comparison between arrays and various collections., and arrays are really much faster (maybe x3 compared to ArrayList).

Also, if you use Double instead of double, you need to stick to it, and use no double, as auto(un)boxing will kill your performance.

Considering your performance need, I would stick to array of primitive type.


Even more, I would calculate only once the upper bound for the condition in loops. This is typically done the line before the loop.

However, if you don't like that the upper bound variable, used only in the loop, is accessible outside the loop, you can take advantage of the initialization phase of the for loop like this:

    for (int i=0, max=list.size(); i<max; i++) {
      // do something
    }

I don't believe in obsolescence for arrays in java. For performance-critical loop, I can't see any language designer taking away the fastest option (especially if the difference is x3).


I understand your concern for maintainability, and for coherence with the rest of the application. But I believe that a critical loop is entitled to some special practices.

I would try to make the code the clearest possible without changing it:

  • by carefully questionning each variable name, ideally with a 10-min brainstorming session with my collegues
  • by writing coding comments (I'm against their use in general, as a code that is not clear should be made clear, not commented ; but a critical loop justifies it).
  • by using private methods as needed (as Andreas_D pointed out in his answer). If made private final, chances are very good (as they would be short) that they will get inlined when running, so there would be no performance impact at runtime.
like image 166
KLE Avatar answered Oct 05 '22 10:10

KLE