Instruction-Level-Parallelism Exploration

I am just wondering if there are any usefuls tools out there that allow me to exploit the Instruction-Level-Parallelism in some algorithms. More specifically, I have a subset of algorithms from the multimedia domain and I wonder what is the best way to exploit ILP in this algorithms. All this algorithms are implemented in C, so ideally I give these algorithms as input to some tool and it tells me which instructions could be executed in parallel.

Many thanks for any points!

Robert

What is instruction level parallelism?

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

What is the difference between ILP and TLP?

Unlike instruction level parallelism, which exploits implicit parallel operations within a loop or straight-line code segment, thread level parallelism is explicitly represented by the use of multiple threads of execution that are inherently parallel.

What is instruction level parallelism in microprocessors?

Instruction-level Parallelism (ILP) is a family of processor and compiler design techniques that speed up execution by causing individual machine operations, such as memory loads and stores, integer additions and floating point multiplications, to execute in parallel.

The problem is that deciding whether an instruction will be executed in parallel is quite difficult considering how many different processor types there are. A good understanding of the CPU architecture you are targeting will give you a good starting point for doing this sort of work. No software will beat a human mind with the right knowledge.

In general though so much work is done by the compiler and things like Out-of-order execution engines that this tries to get abstracted as much away from you as possible. You will find even by understanding this fully its unlikely you'll get more than a few percent speed improvement.

If you want to see serious speed improvements you are far better off re-writing the algorithm to take advantage of multiple processors and available SIMD operations. You can see serious speed improvements using SIMD alone and this is especially so for a lot of "multimedia algorithms" that can process multiple elements of the data simultaneously.

First, both the compiler and the CPU itself already aggressively reorder instructions to exploit ILP as well as possible. Most likely, they're doing a better job of it than you'd ever be able to.

However, there are a few areas where a human can aid the process.

The compiler is typically very conservative about reordering floating-point computations, because it might slightly change the result. So for example assuming this code:

float f, g, h, i;
float j = f + g + h + i;

you'll likely get zero ILP because the code you've written is evaluated as ((f + g) + h) + i: the result of the first addition is used as an operand for the next, the result of which is used as an operand in the final addition. No two additions can execute in parallel.

If you instead write it as float j = (f + g) + (h + i), the CPU is able to execute f+g and h+i in parallel. They don't depend on each others.

In general, the thing preventing ILP is dependencies. Sometimes they're direct dependencies between arithmetic instructions as above, and sometimes they're store/load dependencies.

Loads and stores take a long time to execute compared to in-register operations, and operations that depend on these will have to wait until the load/store operation finished.

So storing data in temporaries which the compiler can cache in registers can sometimes be used to avoid memory accesses. Likewise, starting loads as soon as possible helps too, to avoid their latency from blocking the following operations.

The best technique is really to look closely at your code, and work out the dependency chains. Each sequence of operations where each one depends on the result of the previous is a chain of dependencies that can never be executed in parallel. Can this chain be broken up in some way? Perhaps by storing a value in a temporary, or perhaps by recomputing a value instead of waiting for the cached version to be loaded from memory. Perhaps just by placing a few parentheses as in the original floating-point example.

When there are no dependencies, the CPU will schedule operations to execute in parallel. So all you need to do to exploit ILP is to break up long dependency chains.

Of course, that's easier said than done... :)

But if you spend some time with a profiler, and study the assembly output from the compiler, you can sometimes get an impressive speedup from manually optimizing your code to better exploit ILP.

Instruction-Level-Parallelism Exploration

Tags:

c++

c

parallel-processing

Robert

People also ask

2 Answers

Goz

jalf

Recent Activity

Donate For Us

Instruction-Level-Parallelism Exploration

Tags:

c++

c

parallel-processing

Robert

People also ask

2 Answers

Goz

jalf

Related questions

Recent Activity

Donate For Us