Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java can recognize SIMD advantages of CPU; or there is just optimization effect of loop unrolling

This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).

When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.

     time1=System.nanotime();
     int count=0;
     for(int j=0;j<1000;i++)
     {

             b=vektors[i]; // selects next vector(b) to multiply as inner product.
                           // each vector has an array of float elements.

             if(((N/2)*2)!=N)
             {
                 for(int i=0;i<N;i++)
                 {
                     t1+=elements[i]*b.elements[i];
                 }
             }
             else if(((N/8)*8)==N)
             {
                 float []vek=new float[8];
                 for(int i=0;i<(N/8);i++)
                 {
                     vek[0]=elements[i]*b.elements[i];
                     vek[1]=elements[i+1]*b.elements[i+1];
                     vek[2]=elements[i+2]*b.elements[i+2];
                     vek[3]=elements[i+3]*b.elements[i+3];
                     vek[4]=elements[i+4]*b.elements[i+4];
                     vek[5]=elements[i+5]*b.elements[i+5];
                     vek[6]=elements[i+6]*b.elements[i+6];
                     vek[7]=elements[i+7]*b.elements[i+7];


                     t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
                     //t1 is total sum of all dot products.
                 }
             }
     }
     time2=System.nanotime();
     time3=(time2-time1)/1000000000.0; //seconds

Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?

If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.

For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).

Thanks.

like image 791
huseyin tugrul buyukisik Avatar asked Dec 11 '22 13:12

huseyin tugrul buyukisik


2 Answers

Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?

Your first loop does this, indented more conventionally:

 for(int j=0;j<1000;i++) {
     b=vektors[i]; // selects next vector(b) to multiply as inner product.
                   // each vector has an array of float elements.
 }

Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.

Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.

I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.

like image 166
tmyklebu Avatar answered Dec 14 '22 03:12

tmyklebu


I have done a little research on this (and am drawing from knowledge from a similar project I did in C with matrix multiplication), but take my answer with a grain of salt as I am by no means an expert on this topic.

As for your first question, I think the speedup is coming from your loop unrolling; you're making roughly 87% fewer condition checks in terms of the for loop. As far as I know, JVM supports SSE since 1.4, but to actually control whether your code is using vectorization (and to know for sure), you'll need to use JNI.

See an example of JNI here: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?

When you decrease the size of your vector to 64 from 262144, cache is definitely a factor. When I did this project in C, we had to implement cache blocking for larger matrices in order to take advantage of the cache. One thing you might want to do is check your cache size.

Just as a side note: It might be a better idea to measure performance in flops rather than seconds, just because the runtime (in seconds) of your program can vary based on many different factors, such as CPU usage at the time.

like image 36
Arjun Baokar Avatar answered Dec 14 '22 03:12

Arjun Baokar