STL find performs better than hand-crafted loop

Question

I have some question. Given the following C++ code fragment:

#include <boost/progress.hpp>

#include <vector>
#include <algorithm>
#include <numeric>
#include <iostream>

struct incrementor
{
  incrementor() : curr_() {}

  unsigned int operator()()
  { return curr_++; }

private:
  unsigned int curr_;
};

template<class Vec>
char const* value_found(Vec const& v, typename Vec::const_iterator i)
{
  return i==v.end() ? "no" : "yes";
}


template<class Vec>
typename Vec::const_iterator find1(Vec const& v, typename Vec::value_type val)
{
  return find(v.begin(), v.end(), val);
}


template<class Vec>
typename Vec::const_iterator find2(Vec const& v, typename Vec::value_type val)
{
  for(typename Vec::const_iterator i=v.begin(), end=v.end(); i<end; ++i)
    if(*i==val) return i;
  return v.end();
}

int main()
{
  using namespace std;
  typedef vector<unsigned int>::const_iterator iter;
  vector<unsigned int> vec;
  vec.reserve(10000000);

  boost::progress_timer pt;

  generate_n(back_inserter(vec), vec.capacity(), incrementor());
  //added this line, to avoid any doubts, that compiler is able to
  // guess the data is sorted
  random_shuffle(vec.begin(), vec.end());

  cout << "value generation required: " << pt.elapsed() << endl;

  double d;
  pt.restart();
  iter found=find1(vec, vec.capacity());
  d=pt.elapsed();
  cout << "first search required: " << d << endl;
  cout << "first search found value: " << value_found(vec, found)<< endl;


  pt.restart();
  found=find2(vec, vec.capacity());
  d=pt.elapsed();
  cout << "second search required: " << d << endl;
  cout << "second search found value: " << value_found(vec, found)<< endl;


  return 0;
}

On my machine (Intel i7, Windows Vista) STL find (call via find1) runs about 10 times faster than the hand-crafted loop (call via find2). I first thought that Visual C++ performs some kind of vectorization (may be I am mistaken here), but as far as I can see assembly does not look the way it uses vectorization. Why is STL loop faster? Hand-crafted loop is identical to the loop from the STL-find body.

I was asked to post program's output. Without shuffle:

value generation required: 0.078
first search required: 0.008
first search found value: no
second search required: 0.098
second search found value: no

With shuffle (caching effects):

value generation required: 1.454
first search required: 0.009
first search found value: no
second search required: 0.044
second search found value: no

Many thanks,

dusha.

P.S. I return the iterator and write out the result (found or not), because I would like to prevent compiler optimization, that it thinks the loop is not required at all. The searched value is obviously not in the vector.

P.P.S. I was asked to post assembly generated for the find functions. Here it is:

found=find1(vec, vec.capacity());
001811D0  lea         eax,[esp+5Ch] 
001811D4  call        std::vector<unsigned int,std::allocator<unsigned int> >::capacity (1814D0h) 
001811D9  mov         esi,dword ptr [esp+60h] 
001811DD  mov         ecx,dword ptr [esp+64h] 
001811E1  cmp         esi,ecx 
001811E3  je          wmain+180h (1811F0h) 
001811E5  cmp         dword ptr [esi],eax 
001811E7  je          wmain+180h (1811F0h) 
001811E9  add         esi,4 
001811EC  cmp         esi,ecx 
001811EE  jne         wmain+175h (1811E5h) 



found=find2(vec, vec.capacity());
001812AE  lea         eax,[esp+5Ch] 
001812B2  call        std::vector<unsigned int,std::allocator<unsigned int> >::capacity (1814D0h) 
001812B7  mov         ecx,dword ptr [esp+60h] 
001812BB  mov         edx,dword ptr [esp+64h] 
001812BF  cmp         ecx,edx 
001812C1  je          wmain+262h (1812D2h) 
001812C3  cmp         dword ptr [ecx],eax 
001812C5  je          wmain+34Fh (1813BFh) 
001812CB  add         ecx,4 
001812CE  cmp         ecx,edx 
001812D0  jne         wmain+253h (1812C3h)

find2 uses ecx-register instead of esi. What is the difference between these two registers? Can it be that esi assumes the pointer to be properly aligned and therefore brings additional performance?

Read some assembly reference ecx is just a counter, whereas esi is the memory source. So I think STL algorithm knows that Random Access Iterator is properly aligned and therefore uses memory pointers. Where in the non-STL version there is no speculation how the alignment is. Am I right?

Billy ONeal · Accepted Answer

Visual C++'s find algorithm uses unchecked iterators, whereas your hand written loop is using checked iterators.

~~My other guess is that you're calling std::vector<t>::end() on every iteration of your loop in find2, while std::find results in only one call to the begin and end accessors.~~ I am an idiot.

villintehaspam · Answer

Make sure that you compile your code in release mode with checked iterators turned off

Set _SECURE_SCL=0 in your preprocessor definitions.

Also, boost::progress_timer has a resolution of milliseconds I believe (it is based on std::clock), this makes it very unreliable for accurate measurements of short durations. You need to make the code that you measure significantly slower so that you get rid of other factors (such as your process being on hold and so on). You should measure using the high performance counters as suggested by DeadMG.

John Bartholomew · Answer

Your measurement methodology is broken. Measuring the execution speed of code is very difficult to do correctly, because the total elapsed time depends on factors which may not be explicitly controlled by the code you've written.

Some things to check (you may consider some of these obvious):

What optimisation settings are you using? You are testing a release build, right?
You said you've checked the assembly code generated by the STL version and it isn't using vectorisation. But maybe it's using some other common optimisation method such as loop unrolling?
Why are you using i < end rather than i != end in your loop? (I actually doubt this makes any difference, but who knows?)

(My original answer was completely stupid -- not sure why it got a vote -- I'm leaving it here since some of the comments relate to it)

In this case, I suspect you're just seeing the effects of the memory hierarchy. When you call find1(), the CPU has to read all the data from RAM. That data will then be stored in the CPU's cache, which is much (easily 10 to 100 times) faster than accessing RAM. When you call find2(), the CPU can read the whole array from cache memory, and so find2() takes less time to execute.

To get some more evidence, try swapping the code so that you measure find2() first and then find1(). If your results get reversed, it's likely that you're seeing the effect of the cache. If they don't, then it's something else.

Edit After some further thought (actually, just after some thought), I think my original suspicion must be incorrect (the size of the array you're searching through makes it unlikely that the entire array has been put into the cache). There may still be cache effects, but they are probably somewhat more subtle. Still, try reversing the measurements anyway, it'd be interesting to see what effect it has.

Puppy · Answer

find doesn't take a value_type, it takes a const value_type&. Now, I would say that for unsigned int, this should make no difference. However, it's perfectly possible that your optimizer just didn't notice this, and failed to correctly optimize your loop body.

Edit: What I would suggest is that you're kind of lying to the compiler with your use of the for loop. You could rewrite it as

typename Vec::iterator i, end;
i = vec.begin();
end = vec.end();
while(i != end && *i != val)
    i++;
return i;

Of course, the guy who wrote std::find knows exactly how smart the optimizer is, and what exactly it can and can't cope with.

Edit: I ran your test on my machine. It's an i7 930, no overclock, on Visual Studio 2010. I replaced the boost::progress_timer with the High Performance Counter.

__int64 frequency, begin, end;
QueryPerformanceCounter(frequency);
double d;
QueryPerformanceCounter(begin);
iter found=find1(vec, vec.capacity());
QueryPerformanceCounter(end);
d = ((end - begin) / (double)frequency) * 1000000;
cout << "first search required: " << d << endl;
cout << "first search found value: " << value_found(vec, found)<< endl;


QueryPerformanceCounter(begin);
found=find2(vec, vec.capacity());
QueryPerformanceCounter(end);
d = ((end - begin) / (double)frequency) * 1000000;
cout << "second search required: " << d << endl;
cout << "second search found value: " << value_found(vec, found)<< endl;

Says that both of them took 0.24 (roughly) nanoseconds to operate- that is, there is no difference. My suggestion is that your optimizer is just immature and that your version of std::find is written exactly to present the correct optimizations, whereas your find just doesn't tick the correct optimization boxes.

Edit: Your timing data is clearly busted. My i7 is running in 0.23 nanoseconds- that is, 0.00000023 seconds, whereas yours wants 0.008 seconds. Unless my i7 is about 40,000 times faster than yours, there's no way. There is no way that an i7 will take so long to traverse only ten million items. Of course, I'm actually running 64bit Windows 7, although didn't compile in 64bit mode.

Going to post the disassembler now.

find1:

00F810D3  mov         esi,dword ptr [esp+34h]  
00F810D7  mov         eax,dword ptr [esp+3Ch]  
00F810DB  mov         ecx,dword ptr [esp+38h]  
00F810DF  sub         eax,esi  
00F810E1  sar         eax,2  
00F810E4  cmp         esi,ecx  
00F810E6  je          main+0B3h (0F810F3h)  
00F810E8  cmp         dword ptr [esi],eax  
00F810EA  je          main+0B3h (0F810F3h)  
00F810EC  add         esi,4  
00F810EF  cmp         esi,ecx  
00F810F1  jne         main+0A8h (0F810E8h)

find2:

00F8119A  mov         ecx,dword ptr [esp+34h]  
00F8119E  mov         eax,dword ptr [esp+3Ch]  
00F811A2  mov         edx,dword ptr [esp+38h]  
00F811A6  sub         eax,ecx  
00F811A8  sar         eax,2  
00F811AB  cmp         ecx,edx  
00F811AD  jae         main+17Fh (0F811BFh)  
00F811AF  nop  
00F811B0  cmp         dword ptr [ecx],eax  
00F811B2  je          main+254h (0F81294h)  
00F811B8  add         ecx,4  
00F811BB  cmp         ecx,edx  
00F811BD  jb          main+170h (0F811B0h)  
00F811BF  mov         esi,edx

You can see that find2 is slightly different to find1. I checked by replacing the call to find2 with another find1 call which does produce identical disassembly. Curious that they produce different assembly.

Johan Kotlinski · Answer

I don't use Visual C++ myself, but with GCC I also got the result that find2 is a little slower. However, I was able to make find2 just slightly faster than find1 by manually unrolling the loop:

template<class Vec>
typename Vec::const_iterator find2(Vec const& v, typename Vec::value_type val)
{
  for(typename Vec::const_iterator i=v.begin(), end=v.end(); i != end; ) {
    if (i[0]==val) return i;
    if (i[1]==val) return i + 1;
    i += 2;
  }
  return v.end();
}

My guess to why std::find is faster is that the compiler has all the information to figure out that the vector size is a multiple of 2, and that it is possible to do this unrolling.

Another guess is that it is just a tradeoff between space/size - that the compiler skips this optimization in the general case.

STL find performs better than hand-crafted loop

Tags:

c++

performance

find

assembly

stl

dusha

5 Answers

Billy ONeal

villintehaspam

John Bartholomew

Puppy

Johan Kotlinski

Recent Activity

Donate For Us

STL find performs better than hand-crafted loop

Tags:

c++

performance

find

assembly

stl

dusha

5 Answers

Billy ONeal

villintehaspam

John Bartholomew

Puppy

Johan Kotlinski

Related questions

Recent Activity

Donate For Us