When, if ever, is loop unrolling still useful?

Tags:

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:

// Search for elements to swap. while(myArray[++index1] < pivot) {} while(pivot < myArray[--index2]) {}

I tried unrolling to something like:

while(true) {     if(myArray[++index1] < pivot) break;     if(myArray[++index1] < pivot) break;     // More unrolling }   while(true) {     if(pivot < myArray[--index2]) break;     if(pivot < myArray[--index2]) break;     // More unrolling }

This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?

646

asked Feb 27 '10 22:02

dsimcha

1 Answers

Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.

A simple example:

for (int i=0; i<n; i++) {   sum += data[i]; }

Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.

On the other hand this code:

for (int i=0; i<n-3; i+=4)  // note the n-3 bound for starting i + 0..3 {   sum1 += data[i+0];   sum2 += data[i+1];   sum3 += data[i+2];   sum4 += data[i+3]; } sum = sum1 + sum2 + sum3 + sum4; // if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever

could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.

(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)

130

answered Sep 19 '22 16:09

Nils Pipenbrinck

Related questions
                            
                                Split a large dataframe into a list of data frames based on common value in column
                            
                                Blazor performance
                            
                                Should a developer aim for readability or performance first? [closed]
                            
                                What is the overhead of Rust's Option type?
                            
                                atomic operation cost
                            
                                Is using java Map.containsKey() redundant when using map.get()
                            
                                How do you test running time of VBA code?
                            
                                What is the real overhead of try/catch in C#?
                            
                                C# vs C - Big performance difference
                            
                                Is Python faster and lighter than C++? [closed]
                            
                                External CSS vs inline style performance difference?
                            
                                Visual Studio 2017 is too slow during building and debugging
                            
                                Weird performance increase in simple benchmark
                            
                                The HashSet<T>.removeAll method is surprisingly slow
                            
                                What is the purpose of the EBP frame pointer register?
                            
                                MySQL "IN" operator performance on (large?) number of values
                            
                                What's the fastest way to merge/join data.frames in R?
                            
                                Stopwatch vs. using System.DateTime.Now for timing events [duplicate]
                            
                                Should Github be used as a CDN for javascript libraries? [closed]
                            
                                Clustered vs Non-Clustered

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When, if ever, is loop unrolling still useful?

Tags:

performance

language-agnostic

optimization

micro-optimization

dsimcha

People also ask

1 Answers

Nils Pipenbrinck

Recent Activity

Donate For Us