I have a single line of code, that consumes 25% - 30% of the runtime of my application. It is a less-than comparator for an std::set (the set is implemented with a Red-Black-Tree). It is called about 180 Million times within 28 seconds.
struct Entry {
const float _cost;
const long _id;
// some other vars
Entry(float cost, float id) : _cost(cost), _id(id) {
}
};
template<class T>
struct lt_entry: public binary_function <T, T, bool>
{
bool operator()(const T &l, const T &r) const
{
// Most readable shape
if(l._cost != r._cost) {
return r._cost < l._cost;
} else {
return l._id < r._id;
}
}
};
The entries should be sorted by cost and if the cost is the same by their id. I have many insertions for each extraction of the minimum. I thought about using Fibonacci-Heaps, but I have been told that they are theoretically nice, but suffer from high constants and are pretty complicated to implement. And since insert is in O(log(n)) the runtime increase is nearly constant with large n. So I think its okay to stick to the set.
To improve performance I tried to express it in different shapes:
return l._cost < r._cost || r._cost > l._cost || l._id < r._id;
return l._cost < r._cost || (l._cost == r._cost && l._id < r._id);
Even this:
typedef union {
float _f;
int _i;
} flint;
//...
flint diff;
diff._f = (l._cost - r._cost);
return (diff._i && diff._i >> 31) || l._id < r._id;
But the compiler seems to be smart enough already, because I haven't been able to improve the runtime.
I also thought about SSE but this problem is really not very applicable for SSE...
The assembly looks somewhat like this:
movss (%rbx),%xmm1
mov $0x1,%r8d
movss 0x20(%rdx),%xmm0
ucomiss %xmm1,%xmm0
ja 0x410600 <_ZNSt8_Rb_tree[..]+96>
ucomiss %xmm0,%xmm1
jp 0x4105fd <_ZNSt8_Rb_[..]_+93>
jne 0x4105fd <_ZNSt8_Rb_[..]_+93>
mov 0x28(%rdx),%rax
cmp %rax,0x8(%rbx)
jb 0x410600 <_ZNSt8_Rb_[..]_+96>
xor %r8d,%r8d
I have a very tiny bit experience with assembly language, but not really much.
I thought it would be the best (only?) point to squeeze out some performance, but is it really worth the effort? Can you see any shortcuts that could save some cycles?
The platform the code will run on is an ubuntu 12 with gcc 4.6 (-stl=c++0x) on a many-core intel machine. Only libraries available are boost, openmp and tbb. The 30 second benchmark was performed on my 4yr old laptop (core 2 duo).
I am really stuck on this one, it seems so simple, but takes that much time. I have been crunching my head since days thinking how I could improve this line...
Can you give me a suggestion how to improve this part, or is it already at its best?
EDIT 1: After using Jerrys suggestion I achieved a speed up of ~4.5 seconds. EDIT 2: After trying boost Fibonacci heaps the comparison went to 174 Mio calls to the less than function.
An easy solution is precompute a sort-identifier comprised of the cost as most significant and the id as the rest.
E.g.,
struct Entry
{
double cost_;
long id_;
long long sortingId_;
// some other vars
Entry( double cost, float id )
: cost_( cost ), id_( id ), sortingId_( 1e9*100*cost + id )
{}
};
Adjust sortingId_
value based on what you can assume about the value ranges.
Then, now just sort on sortingId_
.
Or as variation of the same idea, if you can't make suitable assumptions about the data, then consider preparing data especially for memcmp
.
For a higher level solution, remember that std::set::insert
has an overload with a hint argument. If your data is already in near sorted order, that might seriously reduce the number of calls to your comparator function.
And you might consider whether a std::unordered_set
might be sufficient? I.e. whether you need to list the data in sorted order. Or if the sorting is just the internal stuff of std::set
element insertion.
Finally, for other readers (the OP has made clear that he's aware of this), remember to MEASURE.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With