Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is vectorization, faster in general, than loops?

Why, at the lowest level of the hardware performing operations and the general underlying operations involved (i.e.: things general to all programming languages' actual implementations when running code), is vectorization typically so dramatically faster than looping?

What does the computer do when looping that it doesn't do when using vectorization (I'm talking about the actual computations that the computer performs, not what the programmer writes), or what does it do differently?

I have been unable to convince myself why the difference should be so significant. I could probably be persuaded that vectorized code shaves off some looping overhead somewhere, but the computer still has to perform the same number of operations, doesn't it? For example, if we're multiplying a vector of size N by a scalar, we'll have N multiplications to perform either way, won't we?

like image 208
Ben Sandeen Avatar asked Jan 29 '16 18:01

Ben Sandeen


People also ask

Why vectorization is faster than for loops?

A major reason why vectorization is faster than its for loop counterpart is due to the underlying implementation of Numpy operations. As many of you know (if you're familiar with Python), Python is a dynamically typed language.

Why is vector operation faster than for loop MATLAB?

MATLAB is designed to perform vector operations really quickly. MATLAB is an interpreted language, which is why loops are so slow in it. MATLAB sidesteps this issue by providing extremely fast (usually written in C, and optimized for the specific architecture) and well tested functions to operate on vectors.

What is vectorization and how does it improves performance?

Vectorization is the process of converting an algorithm from operating on a single value at a time to operating on a set of values (vector) at one time. Modern CPUs provide direct support for vector operations where a single instruction is applied to multiple data (SIMD).

What does it mean to vectorize a loop?

Loop vectorization transforms procedural loops by assigning a processing unit to each pair of operands. Programs spend most of their time within such loops. Therefore, vectorization can significantly accelerate them, especially over large data sets.


1 Answers

Vectorization (as the term is normally used) refers to SIMD (single instruction, multiple data) operation.

That means, in essence, that one instruction carries out the same operation on a number of operands in parallel. For example, to multiply a vector of size N by a scalar, let's call M the number of operands that size that it can operate on simultaneously. If so, then the number of instructions it needs to execute is approximately N/M, where (with purely scalar operations) it would have to carry out N operations.

For example, Intel's current AVX 2 instruction set uses 256-bit registers. These can be used to hold (and operate on) a set of 4 operands of 64-bits apiece, or 8 operands of 32 bits apiece.

So, assuming you're dealing with 32-bit, single-precision real numbers, that means a single instruction can do 8 operations (multiplications, in your case) at once, so (at least in theory) you can finish N multiplications using only N/8 multiplication instructions. At least, in theory, this should allow the operation to finish about 8 times as fast as executing one instruction at a time would allow.

Of course, the exact benefit depends on how many operands you support per instruction. Intel's first attempts only supported 64-bit registers, so to operate on 8 items at once, those items could only be 8 bits apiece. They currently support 256-bit registers, and they've announced support for 512-bit (and they may have even shipped that in a few high-end processors, but not in normal consumer processors, at least yet). Making good use of this capability can also be non-trivial, to put it mildly. Scheduling instructions so you actually have N operands available and in the right places at the right times isn't necessarily an easy task (at all).

To put things in perspective, the (now ancient) Cray 1 gained a lot of its speed exactly this way. Its vector unit operated on sets of 64 registers of 64 bits apiece, so it could do 64 double-precision operations per clock cycle. On optimally vectorized code, it was much closer to the speed of a current CPU than you might expect based solely on its (much lower) clock speed. Taking full advantage of that wasn't always easy though (and still isn't).

Keep in mind, however, that vectorization is not the only way in which a CPU can carry out operations in parallel. There's also the possibility of instruction-level parallelism, which allows a single CPU (or the single core of a CPU) to execute more than one instruction at a time. Most modern CPUs include hardware to (theoretically) execute up to around 4 instructions per clock cycle1 if the instructions are a mix of loads, stores, and ALU. They can fairly routinely execute close to 2 instructions per clock on average, or more in well-tuned loops when memory isn't a bottleneck.

Then, of course, there's multi-threading--running multiple streams of instructions on (at least logically) separate processors/cores.

So, a modern CPU might have, say, 4 cores, each of which can execute 2 vector multiplies per clock, and each of those instructions can operate on 8 operands. So, at least in theory, it can be carrying out 4 * 2 * 8 = 64 operations per clock.

Some instructions have better or worse throughput. For example, FP adds throughput is lower than FMA or multiply on Intel before Skylake (1 vector per clock instead of 2). But boolean logic like AND or XOR has 3 vectors per clock throughput; it doesn't take many transistors to build an AND/XOR/OR execution unit, so CPUs replicate them. Bottlenecks on the total pipeline width (the front-end that decodes and issues into the out-of-order part of the core) are common when using high-throughput instructions, rather than bottlenecks on a specific execution unit.


  1. But, over time CPUs tend to have more resources available, so this number rises.
like image 173
Jerry Coffin Avatar answered Oct 08 '22 22:10

Jerry Coffin