I am pretty amazed on the result that traversing takes more time than merging on two sorted std::list
by around 12%. Since merging can be considered and implemented as continuous element comparisons, list splice and iterators traversal through two separated sorted linked lists. Hence, traversing should not be slower than merging through them especially when two lists are large enough because the ratio of iterated elements is getting increased.
However, the result seems to not match what I thought, and this is how I test my ideas above:
std::list<int> list1, list2;
for (int cnt = 0; cnt < 1 << 22; cnt++)
list1.push_back(rand());
for (int cnt = 0; cnt < 1 << 23; cnt++)
list2.push_back(rand());
list1.sort();
list2.sort();
auto start = std::chrono::system_clock::now(); // C++ wall clock
// Choose either one option below
list1.merge(list2); // Option 1
for (auto num : list1); // Option 2
for (auto num : list2); // Option 2
std::chrono::duration<double> diff = std::chrono::system_clock::now() - start;
std::cout << std::setprecision(9) << "\n "
<< diff.count() << " seconds (measured)" << std::endl; // show elapsed time
PS. icc
is smart enough to eliminate Option 2. Try sum += num;
and print out sum
.
This is the output from perf
: (the measured time remains the same without using perf
)
Option 1: Merge
0.904575206 seconds (measured)
Performance counter stats for './option-1-merge':
33,395,981,671 cpu-cycles
149,371,004 cache-misses # 49.807 % of all cacherefs
299,898,436 cache-references
24,254,303,068 cycle-activity.stalls-ldm-pending
7.678166480 seconds time elapsed
Option 2: Traverse
1.01401903 seconds (measured)
Performance counter stats for './option-2-traverse':
33,844,645,296 cpu-cycles
138,723,898 cache-misses # 48.714 % of all cacherefs
284,770,796 cache-references
25,141,751,107 cycle-activity.stalls-ldm-pending
7.806018949 seconds time elapsed
Due to the property of horrible spatial locality on these linked lists. The cache miss is the major reason that make CPU stalls, and occupies most of the CPU resources. The strange point is that option 2 has fewer cache misses than option 1, but it has a higher amount of CPU stalls and CPU cycles to accomplish its task. What makes this abnormality happen?
As you know, it is memory that is taking all your time.
Cache misses are bad, but so are stalls.
From this paper:
Applications with irregular memory access patterns, e.g.,dereferencing chains of pointers when traversing linked lists or trees, may not generate enough concurrently outstanding requests to fully utilize the data paths. Nevertheless, such applications are clearly limited by the performance of memory accesses as well. Therefore, considering the bandwidth utilization is not sufficient to detect all memory related performance issues.
Basically, randomly walking pointers can fail to saturate the memory bandwidth.
The tight loop on each is blocked each iteration by waiting on where the next pointer is to be loaded. If it is not in cache, the cpu can do nothing -- it stalls.
The combined tight loop/merge tries to load two pages into the cache. When one is loading, sometimes the cpu can advannce on the other.
The result you measured was that the merge has fewer stalls that the naked wasted double iteration.
Or in other words,
24,254,303,068 cycle-activity.stalls-ldm-pending
is a big number and smaller than:
25,141,751,107 cycle-activity.stalls-ldm-pending
I am surprised this is enough to make a 10% difference, but that is why perf is about measuring.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With