During experimenting with std::set in C++ and set() in Python I faced performance issue that I can't explain. Set intersection in C++ at least 3 times slower than in Python.
So could anybody point me at optimization that could be done to C++ code and/or explain how Python do this so much faster?
I expect that both of them use similar algorithm with O(n) complexity while set is ordered. But probably Python do some optimizations so it reach smaller coefficient.
set_bench.cc
#include <iostream>
#include <set>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>
void elapsed(std::function<void()> f, const std::string& s)
{
auto start = std::chrono::steady_clock::now();
f();
std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start;
std::cout << s << " " << elapsed.count() << " seconds" << std::endl;
}
template <typename T>
void fill_set(std::set<T>& s, T start, T end, T step)
{
for (T i = start; i < end; i += step) {
s.emplace(i);
}
}
template <typename T>
void intersect(const std::set<T>& s1, const std::set<T>& s2, std::set<T>& result)
{
std::set_intersection(s1.begin(), s1.end(),
s2.begin(), s2.end(),
std::inserter(result, result.begin()));
}
int main()
{
std::set<int64_t> s1;
std::set<int64_t> s2;
std::set<int64_t> s3;
elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000*1000*100, 13), "fill s1 took");
elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000*1000*100, 7), "fill s2 took");
std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;
elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");
std::cout << "s3 length = " << s3.size() << std::endl;
// sleep to let check memory consumption
// while (true) std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
set_bench.py
#!/usr/bin/env python3
import time
def elapsed(f, s):
start = time.monotonic()
f()
elapsed = time.monotonic() - start
print(f'{s} {elapsed} seconds')
def fill_set(s, start, end, step=1):
for i in range(start, end, step):
s.add(i)
def intersect(s1, s2, result):
result.update(s1 & s2)
s1 = set()
s2 = set()
elapsed(lambda : fill_set(s1, 8, 1000*1000*100, 13), 'fill s1 took')
elapsed(lambda : fill_set(s2, 0, 1000*1000*100, 7), 'fill s2 took')
print(f's1 length = {len(s1)}, s2 length = {len(s2)}')
s3 = set()
elapsed(lambda: intersect(s1, s2, s3), 'intersect s1 and s2 took')
print(f's3 length = {len(s3)}')
# sleep to let check memory consumption
# while True: time.sleep(1)
Here is results of running this programs in next environment:
$ clang -lstdc++ -O0 set_bench.cc -o set_bench && ./set_bench
fill s1 took 5.38646 seconds
fill s2 took 10.5762 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 1.48387 seconds
s3 length = 1098901
$ clang -lstdc++ -O1 set_bench.cc -o set_bench && ./set_bench
fill s1 took 3.31435 seconds
fill s2 took 6.41415 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 1.01276 seconds
s3 length = 1098901
$ clang -lstdc++ -O2 set_bench.cc -o set_bench && ./set_bench
fill s1 took 1.90269 seconds
fill s2 took 3.85651 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.512727 seconds
s3 length = 1098901
$ clang -lstdc++ -O3 set_bench.cc -o set_bench && ./set_bench
fill s1 took 1.92473 seconds
fill s2 took 3.72621 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.523683 seconds
s3 length = 1098901
$ gcc -lstdc++ -O3 set_bench.cc -o set_bench && time ./set_bench
fill s1 took 1.72481 seconds
fill s2 took 3.3846 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.516702 seconds
s3 length = 1098901
$ python3.7 ./set_bench.py
fill s1 took 0.9404696229612455 seconds
fill s2 took 1.082577683031559 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.17995300807524472 seconds
s3 length = 1098901
As you can see results are equal so I assume both programs do the same calculations.
By the way - RSS for C++ program is 1084896 kB and for Python - 1590400 kB.
There are two questions in this post:
Q: How to improve
std::set_intersection
performance in C++?
Use sorted std::vector
s instead of sets, that's much more cache-friendly. Since the intersection is done sequentially in a single pass, it will be as fast as it can get.
On my system I got a 0.04 s run time.
Stop here if this is all you needed.
Q: ... how [does] Python do this so much faster?
Or in other words "why is Python's set faster than the C++ set?". I'll focus on this question for the rest of my post.
First of all, a Python's set
is a hash table and std::set
is a binary tree. So use std::unordered_set
to compare apples to apples (we reject a binary tree at this point based on its O(logN) lookup complexity).
Note also that std::set_intersection
is simply a two-pointer algorithm; it iterates over two sorted sets, keeping only matching values.
Apart from it's name, there is nothing in common with Python's set_intersection
, which by itself is just a simple loop:
So we can't use std::set_intersection
on unsorted data, and need to implement the loop:
for (auto& v : set1) {
if (set2.find(v) != set2.end()) {
result.insert(v);
}
}
Nothing fancy here. Unfortunately though a straightforward application of this algorithm on std::unordered_set
is still slower by a factor of 3. How can that be?
We observe that the input data set is > 100MB in size. That won't fit in 8MB the cache of an i7-7700, which means the more work you can fit within the boundaries of 8MB, the faster your program will perform.
Python uses a special form of "dense hash table" similar to that of PHP hash table (generally a class of open addressing hash tables), whereas the C++ std::unordered_set
is typically a naïve, or vector-of-lists, hash table. A dense structure is much more cache-friendly, and thus faster. For implementation details see dictobject.c and setobject.c.
The built-in C++ std::hash<long>
is too complex for the already unique input data set you're generating. Python on the other hand uses an identity (no-op) hashing function for integers up to 230 (see long_hash
). Collisions are amortised by the LCG built into their hashtable implementation. You can't match that with C++ standard library features; an identity hash here will unfortunately again result in a too sparse hashtable.
Python uses a custom memory allocator pymalloc, which is similar to jemalloc and optimized for data locality. It generally outperform the built-in Linux tcmalloc, which is what a C++ program would normally use.
With that knowledge we can contrive a similarly performing C++ version, to demonstrate the technical feasibility:
#include <iostream>
#include <unordered_set>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>
#include <tuple>
#include <string>
using namespace std::chrono_literals;
void elapsed(std::function<void()> f, const std::string& s)
{
auto start = std::chrono::steady_clock::now();
f();
auto end = std::chrono::steady_clock::now();
std::cout << s << " " << (end - start) / 1.0s << " seconds" << std::endl;
}
template <typename T>
struct myhash {
size_t operator()(T x) const {
return x / 5; // cheating to improve data locality
}
};
template <typename T>
using myset = std::unordered_set<T, myhash<T>>;
template <typename T>
void fill_set(myset<T>& s, T start, T end, T step)
{
s.reserve((end - start) / step + 1);
for (T i = start; i < end; i += step) {
s.emplace(i);
}
}
template <typename T>
void intersect(const myset<T>& s1, const myset<T>& s2, myset<T>& result)
{
result.reserve(s1.size() / 4); // cheating to compete with a better memory allocator
for (auto& v : s1)
{
if (s2.find(v) != s2.end())
result.insert(v);
}
}
int main()
{
myset<int64_t> s1;
myset<int64_t> s2;
myset<int64_t> s3;
elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000 * 1000 * 100, 13), "fill s1 took");
elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000 * 1000 * 100, 7), "fill s2 took");
std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;
elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");
std::cout << "s3 length = " << s3.size() << std::endl;
}
With this code I got 0.28 s run times in both C++ and Python versions.
Now if we want to beat Python's set performance, we can remove all cheats and use Google's dense_hash_set
, which implements open addressing with quadratic probing, as a drop-in replacement (just need to call set_empty_object(0)
).
With google::dense_hash_set
and a no-op hashing function, we get:
fill s1 took 0.321397 seconds
fill s2 took 0.529518 seconds
s1 length = 7692308, s2 length = 14285714
intersect s1 and s2 took 0.0974416 seconds
s3 length = 1098901
Or 2.8 times faster than Python, while keeping the hash set functionality!
P.S. One would think - why does the C++ standard library implement such a slow hash table? The no-free-lunch theorem also applies here: a probing-based solution is not always fast; being an opportunistic solution, it sometimes suffers from "clumping" (endlessly probing into occupied space). And when that happens, performance drop exponentially. The idea behind the standard library implementation was to guarantee predictable performance for all possible inputs. Unfortunately though the caching effect on modern hardware is too great to be neglected, as Chandler Carruth explains in his talk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With