Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is traversing more time-consuming than merging on two sorted std::list?

I am pretty amazed on the result that traversing takes more time than merging on two sorted std::list by around 12%. Since merging can be considered and implemented as continuous element comparisons, list splice and iterators traversal through two separated sorted linked lists. Hence, traversing should not be slower than merging through them especially when two lists are large enough because the ratio of iterated elements is getting increased.

However, the result seems to not match what I thought, and this is how I test my ideas above:

std::list<int> list1, list2;

for (int cnt = 0; cnt < 1 << 22; cnt++)
    list1.push_back(rand());
for (int cnt = 0; cnt < 1 << 23; cnt++)
    list2.push_back(rand());

list1.sort();
list2.sort();

auto start = std::chrono::system_clock::now();  // C++ wall clock

// Choose either one option below
list1.merge(list2);         // Option 1
for (auto num : list1);     // Option 2
for (auto num : list2);     // Option 2

std::chrono::duration<double> diff = std::chrono::system_clock::now() - start;
std::cout << std::setprecision(9) << "\n       "
          << diff.count() << " seconds (measured)" << std::endl;  // show elapsed time

PS. icc is smart enough to eliminate Option 2. Try sum += num; and print out sum.

This is the output from perf: (the measured time remains the same without using perf)

Option 1: Merge

       0.904575206 seconds (measured)

 Performance counter stats for './option-1-merge':

    33,395,981,671      cpu-cycles
       149,371,004      cache-misses              #   49.807 % of all cacherefs
       299,898,436      cache-references
    24,254,303,068      cycle-activity.stalls-ldm-pending    

       7.678166480 seconds time elapsed

Option 2: Traverse

       1.01401903 seconds (measured)

 Performance counter stats for './option-2-traverse':

    33,844,645,296      cpu-cycles
       138,723,898      cache-misses             #   48.714 % of all cacherefs
       284,770,796      cache-references
    25,141,751,107      cycle-activity.stalls-ldm-pending

       7.806018949 seconds time elapsed

Due to the property of horrible spatial locality on these linked lists. The cache miss is the major reason that make CPU stalls, and occupies most of the CPU resources. The strange point is that option 2 has fewer cache misses than option 1, but it has a higher amount of CPU stalls and CPU cycles to accomplish its task. What makes this abnormality happen?

like image 736
Kevin Dong Avatar asked Jul 08 '17 09:07

Kevin Dong


1 Answers

As you know, it is memory that is taking all your time.

Cache misses are bad, but so are stalls.

From this paper:

Applications with irregular memory access patterns, e.g.,dereferencing chains of pointers when traversing linked lists or trees, may not generate enough concurrently outstanding requests to fully utilize the data paths. Nevertheless, such applications are clearly limited by the performance of memory accesses as well. Therefore, considering the bandwidth utilization is not sufficient to detect all memory related performance issues.

Basically, randomly walking pointers can fail to saturate the memory bandwidth.

The tight loop on each is blocked each iteration by waiting on where the next pointer is to be loaded. If it is not in cache, the cpu can do nothing -- it stalls.

The combined tight loop/merge tries to load two pages into the cache. When one is loading, sometimes the cpu can advannce on the other.

The result you measured was that the merge has fewer stalls that the naked wasted double iteration.

Or in other words,

24,254,303,068      cycle-activity.stalls-ldm-pending 

is a big number and smaller than:

25,141,751,107      cycle-activity.stalls-ldm-pending

I am surprised this is enough to make a 10% difference, but that is why perf is about measuring.

like image 159
Yakk - Adam Nevraumont Avatar answered Nov 18 '22 01:11

Yakk - Adam Nevraumont