I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.
Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:
#include <future>
...
int main()
{
...
// Create async calculations
std::future<void> *futures = new std::future<void>[N];
for (int i = 0; i < N; ++i)
{
futures[i] = std::async([&a, &b, i]
{
b[i] = std::sqrt(a[i]);
});
}
// Wait for calculation of all async procedures
for (int i = 0; i < N; ++i)
{
futures[i].get();
}
...
return 0;
}
IdeOne live code
We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com
Or use Concurrency::parallele_for like that :
Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});
Use vector operations.
If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.
Libmvec is vector math library added in Glibc 2.22.
Vector math library was added to support SIMD constructs of OpenMP4.0 (#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding vector implementations of vector math functions.
Vector math functions are vector variants of corresponding scalar math operations implemented using SIMD ISA extensions (e.g. SSE or AVX for x86_64). They take packed vector arguments, perform the operation on each element of the packed vector argument, and return a packed vector result. Using vector math functions is faster than repeatedly calling the scalar math routines.
Also, see Parallel for vs omp simd: when to use each?
If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so
to operate on a vector of complex numbers to obtain the absolute value of each:
Description
These functions evaluate the function hypot(x, y) for an entire vector of values at once. ...
The source code for libmvec
can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot()
code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so
or not.
Using #pragma simd
(even with -Ofast
) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use
//struct of arrays of four complex numbers
struct c4 {
float x[4]; // real values of four complex numbers
float y[4]; // imaginary values of four complex numbers
};
Here is code showing how you could do this with SSE for the x86 instruction set.
#include <stdio.h>
#include <x86intrin.h>
#define N 10
struct c4{
float x[4];
float y[4];
};
static inline void cabs_soa4(struct c4 *a, float *b) {
__m128 x4 = _mm_loadu_ps(a->x);
__m128 y4 = _mm_loadu_ps(a->y);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
}
int main(void)
{
int n4 = ((N+3)&-4)/4; //choose next multiple of 4 and divide by 4
printf("%d\n", n4);
struct c4 a[n4]; //array of struct of arrays
for(int i=0; i<n4; i++) {
for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
}
float b[4*n4];
for(int i=0; i<n4; i++) {
cabs_soa4(&a[i], &b[4*i]);
}
for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}
It may help to unroll the loop a few times. In any case all this is moot for large N
because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel
may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this
for(int i = 0; i < nchunks; i++) {
for(int j = 0; j < chunk_size; j++) {
b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
}
foo(&b[i*chunck_size]); // foo is computationally intensive.
}
I did not implement an array of struct of array here but it should be easy to adjust the code for that.
If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.
So, if you want to run them in parallel you would do:
#include <cilk/cilk.h>
cilk_for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
or if you want to test SIMD:
#pragma simd
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But, the nicest part of Cilk is that you can just do:
b[:] = cabs(a[:])
In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel). Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With