I am processing point cloud data (150k points per cloud). I would like, for each (x,y) point, to compute the distance to a reference point O, and azimuth:
for each point p in points
dx = p.x - ox
dy = p.y - oy
d = hypot(dx, dy)
az = atan2(dy, dx)
I have a manual SSE implementation. I was hoping to make the code clearer using eigen:
ArrayXf x(points.size()), y(points.size());
for(unsigned i=0; i<points.size(); ++i) {
x[i] = points[i].x;
y[i] = points[i].y;
}
const ArrayXf d = (dx.square() + dy.square()).sqrt();
// implement a polynomial approximation to atan (same as the SSE)
However, from my timing experiments, this does not seem to vectorize at all as the time is the same as with the baseline implementation. And I know that SSE2 is enabled since I'm compiling some SSE2 code in the same file.
However, according to the doc, Eigen does take advantage of SSE2 when supported (and AVX in 3.3). Is it only for vectors and matrices operations?
EDIT: I studied the produced assembly code and it does contain some SSE instructions. But it's still slow
EDIT: here is more timing information. I'm looping over 100 frames, about 150k points per frame.
here is my eigen code:
const Eigen::Map<const Eigen::ArrayXf, Eigen::Unaligned, Eigen::InnerStride<4> > px(&(points[0].x), points.size());
const Eigen::Map<const Eigen::ArrayXf, Eigen::Unaligned, Eigen::InnerStride<4> > py(&(points[0].y), points.size());
// difference with the origin (ox and oy are floats)
const Eigen::ArrayXf dx = px - ox, dy = py - oy;
// distance and index
const Eigen::ArrayXf d = sqrt(dx.square() + dy.square());
static const float r_res_mult = 1.0f / r_res; //2x faster than div
const Eigen::ArrayXi didx = (d * r_res_mult).cast<int>();
Your main problem is that your data is not in a format friendly for SIMD. You're using an array of structs (xyxyxyxyxyxy...) and then to vectorize your code you do
for(unsigned i=0; i<points.size(); ++i) {
x[i] = points[i].x;
y[i] = points[i].y;
}
which converts to a struct of arrays (xxxxxxxx....yyyyyyy...). This conversion is expensive.
A better solution is to already have your points stored as a struct of arrays. An even better solution is to use a hybrid struct of arrays aka array of struct of array. For SSE, assuming you're using single floating point, you would then do xxxxyyyyxxxxyyyy....
Next I suggest you use a SIMD math library. Intel offers SVML which is expensive and closed source. AMD offers libm which is free but closed source. But these libraries both don't play well on their competitor's hardware. The best SIMD library is Agner Fog's Vector Class Library (VCL) . It's open source, free, and optimized to work on both Intel and AMD processors. It's also, like Eigen, just header files and therefore, like Eigen, you don't have to compile and link the library. You just included the header files. Here is how you would do it for SSE or AVX for float (the VLC will emulate AVX on a system without AVX).
// g++ -O3 -Ivectorclass -msse4.2 foo.cpp
// or g++ -O3 -Ivectorclass -mavx foo.cpp
#include <vectorclass.h>
#include <vectormath_trig.h>
struct Point2DBlock {
float x[8];
float y[8];
};
int main(void) {
const int nblocks = 10; //each block contains eight points
Point2DBlock aosoa[nblocks]; //xxxxxxxxyyyyyyyy xxxxxxxxyyyyyyyy ...
float ox = 0.0f, oy = 0.0f;
Vec8f vox = ox, voy = oy;
for(int i=0; i<nblocks; i++) {
Vec8f dx = Vec8f().load(aosoa[i].x) - vox;
Vec8f dy = Vec8f().load(aosoa[i].y) - voy;
Vec8f d = sqrt(dx*dx + dy*dy);
Vec8f az = atan2(dy,dx);
}
}
If you really need hypot
. You can build one from the VCL using the pseudo-code from wikipedia.
static inline Vec8f hypot(Vec8f const &x, Vec8f const &y) {
Vec8f t;
Vec8f ax = abs(x), ay = abs(y);
t = min(ax,ay);
ax = max(ax,ay);
t = t/ax;
return ax*sqrt(1+t*t);
}
Edit:
Here is a method using an array of structs. This requires some shuffling but this may be negligible compared to the other calculations. The VLC uses template meta programming to determine an efficient method for shuffling.
#include <vectorclass.h>
#include <vectormath_trig.h>
int main(void) {
const int npoints=80;
float points[2*npoints]; //xyxyxyxyxyxy...
float ox = 0.0, oy = 0.0;
Vec8f vox = ox, voy = oy;
for(int i=0; i<npoints; i+=16) {
Vec8f l1 = Vec8f().load(&points[i+0]);
Vec8f l2 = Vec8f().load(&points[i+8]);
Vec8f dx = blend8f<0, 2, 4, 6, 8, 10, 12, 14>(l1,l2) - vox;
Vec8f dy = blend8f<1, 3, 5, 7, 9, 11, 13, 15>(l1,l2) - voy;
Vec8f d = sqrt(dx*dx + dy*dy);
Vec8f az = atan2(dy,dx);
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With