I have a class A
that has a std::vector<int>
as attribute.
A
needs to fill this vector when an instance of A
is created.
The calculation can take some time and I would like to know if:
I am not familiar with metaprogramming and I didn't find a way to do it for now. It's not an OS-specific question.
Here is the A.cpp
file :
#include "A.h"
#define SIZEV 100
A::A()
{
fillVector();
}
void A::fillVector()
{
// m_vector is an attribute of class "A"
// EXPECTATION 1 : fill the vector with the following calculation at compile time
const int a=5;
const int b=7;
const int c=9;
for(int i=0;i<SIZEV;i++){
for(int j=0;j<SIZEV;j++){
for(int k=0;k<SIZEV;k++){
this->m_vector.push_back(a*i+b*j+c*k);
}
}
}
// EXPECTATION 2 : sort the vector as compile time
}
std::vector<int> A::getVector() const
{
return m_vector;
}
void A::setVector(const std::vector<int> &vector)
{
m_vector = vector;
}
and the main.cpp
(Qt app but doesn't matter):
#include <QCoreApplication>
#include "A.h"
int main(int argc, char *argv[])
{
QCoreApplication app(argc, argv);
A a;
auto vec = a.getVector();
// use vec, etc ...
return app.exec();
}
This is a simple compile time sort for ints. It works by for each element, working out its position within the list. From this it works out what should be at each position. Then it builds a new list inserted at the appropriate positions. It's probably not as efficient in terms of complexity (it's O(n^2)) as the previous solution but it's much easier to understand and it doesn't use recursion.
#include <initializer_list>
#include <array>
#include <tuple>
template<int... members>
struct IntList
{
constexpr bool operator==(IntList) const { return true; }
template<int... others>
constexpr bool operator==(IntList<others...>) const { return false; }
template<int idx>
static constexpr auto at()
{
return std::get<idx>(std::make_tuple(members...));
}
template<int x>
static constexpr auto indexOf()
{
int sum {};
auto _ = { 0, (x > members ? ++sum : 0)... };
return sum;
}
template<int x>
static constexpr auto count()
{
int sum {};
auto _ = { 0, (x == members ? ++sum : 0)... };
return sum;
}
template<int i>
static constexpr auto ith()
{
int result{};
auto _ = {
( i >= indexOf<members>() && i < indexOf<members>() + count<members>() ?
result = members : 0 )...
};
return result;
}
template<std::size_t... i>
static constexpr auto sortImpl(std::index_sequence<i...>)
{
return IntList< ith<i>()... >();
}
static constexpr auto sort()
{
return sortImpl(std::make_index_sequence<sizeof...(members)>());
}
};
static_assert(IntList<1, 2, 3>().at<1>() == 2, "");
static_assert(IntList<>().indexOf<1>() == 0, "");
static_assert(IntList<1>().indexOf<1>() == 0, "");
static_assert(IntList<1, 2, 3, 4>().indexOf<3>() == 2, "");
static_assert(IntList<>().count<1>() == 0, "");
static_assert(IntList<1>().count<1>() == 1, "");
static_assert(IntList<1, 1>().count<1>() == 2, "");
static_assert(IntList<2, 2, 1>().count<1>() == 1, "");
static_assert(IntList<1, 2, 1>().count<1>() == 2, "");
static_assert(IntList<>().sort() == IntList<>(), "");
static_assert(IntList<1>().sort() == IntList<1>(), "");
static_assert(IntList<1, 2>().sort() == IntList<1, 2>(), "");
static_assert(IntList<2, 1>().sort() == IntList<1, 2>(), "");
static_assert(IntList<3, 2, 1>().sort() == IntList<1, 2, 3>(), "");
static_assert(IntList<2, 2, 1>().sort() == IntList<1, 2, 2>(), "");
static_assert(IntList<4, 7, 2, 5, 1>().sort() == IntList<1, 2, 4, 5, 7>(), "");
static_assert(IntList<4, 7, 7, 5, 1, 1>().sort() == IntList<1, 1, 4, 5, 7, 7>(), "");
The data is integers from 0
to SIZEV * (a+b+c)
, but the number of integers is SIZEV
3. It's a dense group of integers with a small range, so CountingSort is perfect (and you never need to build the unsorted array, just increment counts while generating).
Regardless of keeping around the counts / prefix sums, CountingSort is absolutely going to be a big win in startup time to sort the vector, vs. other sorts, keeping everything else the same.
You can keep a compact form (O(cuberoot(n)) size) of your data as a vector of prefix sums, for lookups from m_vector in O(log (cuberoot(n))) time (binary search the prefix sums), where n is the length of m_vector. See below.
Depending on cache / memory latency, not actually expanding m_vector might or might not be a performance win. If a range of values is needed, you can very quickly generate sequential elements of m_vector on the fly, from the prefix sums.
class A {
// vector<uint16_t> m_counts; // needs to be 32b for SIZEV>=794 (found experimentally).
vector<uint32_t> m_pos; // values are huge: indices into m_vector, up to SIZEV**3 - 1
vector<uint16_t> m_vector; // can be 16b until SIZEV>3121: max val is only (a+b+c) * (SIZEV-1)
}
void A::fillVector()
{
const int a=5;
const int b=7;
const int c=9;
const auto max_val = (SIZEV-1) * (a+b+c);
m_vector.reserve(SIZEV*SIZEV*SIZEV);
m_vector.resize(0);
// or clear it, but that writes tons of mem, unless you use a custom Allocator::construct to leave it uninit
// http://en.cppreference.com/w/cpp/container/vector/resize
m_pos.resize(max_val + 1); // again, ideally avoid zeroing
// but if not, do it before m_counts
m_counts.clear(); // do this one last, so it's hot in cache even if others wasted time writing zeros.
m_counts.resize(max_val + 1); // vector is now zeroed
// Optimization: don't have a separate m_counts.
// zero and count into m_pos, then do prefix summing in-place
// manually strength-reduce the multiplication to addition
// in case the compiler decides it won't, or can't prove it won't overflow the same way
// Not necessary with gcc or clang: they both do this already
for(int kc=c*(SIZEV-1) ; kc >= 0 ; kc-=c) {
for(int jb=b*(SIZEV-1) ; jb >= 0 ; jb-=b) {
for(int ia=a*(SIZEV-1) ; ia >= 0 ; ia-=a) {
m_counts[kc + jb + ia]++;
// do the smallest stride in the inner-most loop, for better cache locality
}
}
}
// write the early elements last, so they'll be hot in the cache when we're done
int val = 0;
uint32_t sum = 0;
for ( auto &count : m_counts ) {
m_vector.insert(m_vector.end(), count, val++);
// count is allowed to be zero for vector::insert(pos, count, value)
m_pos[val] = sum; // build our vector of prefix sums
sum += count;
//count = (sum+=count); // in-place conversion to prefix sums
}
assert(m_vector.size() == SIZEV*SIZEV*SIZEV);
}
Or, instead of actually expanding a 1.6GB array, make Prefix sums of the counts, giving you a vector of the start position of the run of that index as an element in m_vector
. i.e. idx = m_pos[val]; m_vector[idx] == val
. (This breaks down for val <= 13, where there are values that can't be represented as a sum of a, b, and c, so there are zeros in m_count
, and repeats in m_pos
)
Anyway, you can replace a read of m_vector[i]
with a binary-search for i
in m_pos
. You're looking for the highest index in m_pos
that has value <= i. That index is what you'd find at m_vector[i]
. (Or something like that; I may have an off-by-one error.)
A hash table won't work, because you need to map multiple values of i
to each number from 0..(750*(a+b+c)). (All the i
s where m_vector[i]
has the same value.)
If you need a run of sequential elements, generate them on the fly into a tmp buffer. Look at m_pos[i+1]
to see when the next element with a different value is coming. (Looking at m_counts
might save some subtraction, but you're probably better off just taking differences in m_pos
to invert the prefix sums, to avoid cache misses / cache pollution from touching a 2nd array.)
Actually, m_counts
probably doesn't need to be kept around as a class member at all, just a temporary in FillVector. Or FillVector can count into m_pos
, and convert it in-place into prefix sums.
Ideally there's something clever you can do with templates to choose a types that are wide enough, but no wider than needed, for m_counts and m_vector. IDK number theory, so I don't know how to prove that there won't be one bucket of m_counts
that overflows a uint16_t
. The average count will be 750**3 / (750*(5+7+9)) = 26786, and they're certainly clustered towards the high end of m_counts
. In practice, SIZEV=793 can use uint16_t counters, while SIZEV=794 produces several counts > 65536 (Thanks to Chris for the working example where I could easily test this).
m_vector
can be uint16_t
until (SIZEV-1)*(a+b+c) > MAX_UINT16
(65535). i.e. until SIZEV >= 3122, at which point m_vector
takes 28.3 GiB of RAM.
At SIZEV = 750, m_pos
is about 2x L1 cache size (Intel CPU) (750*(5+7+9) * 4B per short = 63000B
). If the compiler does a good job and makes a binary search with conditional-move instead of unpredictable branch instructions, this could be quite fast. It will certainly save you a lot of main-memory traffic, which is valuable if you have multiple threads.
Alternatively, never touching m_vector
means you can handle problem sizes which would require more memory than you have to store the list.
If you want to get really creative with optimizing for cache when creating m_counts in the first place (with the triple-nested loop), have the innermost loop go forwards and then back, instead of the same direction both times. This will only matter for extremely large SIZEV, or if the other hyperthread is putting a lot of pressure on the cache.
for(int kc=c*(SIZEV-1) ; kc >= 0 ; kc-=c) {
for(int jb=b*(SIZEV-1) ; jb >= 0 ; jb-=b) {
for(int ia=0 ; ia<SIZEV*a ; ia+=a)
counts[kc + jb + ia]++;
if (! (jb-=b )) break;
for(int ia=a*(SIZEV-1) ; ia >= 0 ; ia-=a)
counts[kc + jb + ia]++;
}
}
Counting down towards zero (with or without the bidirectional inner loops) is very likely a small win for the beginning of the next loop, before it becomes memory-bound doing big memsets when counts get high. Also a win for scanning through forwards to do prefix sums in place.
Is there any hope of finding a closed-form formula for the i
th element in the sorted vector? Or even an O(log i) algorithm for generating it on the fly?
Unless you need a lot of sequential elements from this vector when you access it, it might be faster to compute it on the fly. Memory is slow, CPU is fast, so if you can compute a[i]
in under ~150 clock cycles, you come out ahead. (Assuming every access is a cache miss, or that not touching all that vector memory reduces cache misses in the rest of your program).
If we can do this, we could in theory write the sorted array in order in the first place.
To do that: shuffle the constants so a <= b <= c
.
0, a, [a*2 .. a*int(b/a)], b, [b + a .. b + a*int((c-b)/a) mixed with b*2 .. b*int(c/b)], c, [some number of b*x + a*y], c+a, [more b*x + a*y], ...
Ok, so this is turning into a combinatorial mess, and this idea probably isn't viable. At least, not for the general case of any a, b, and c.
With a=5, b=7, c=9:
0, 5=a, 7=b, 9=c, 10=2a, 12=b+a, 14=2b, 14=c+a, 15=3a, 16=c+b, 18=2c
I'm too sleepy to see a pattern, but here's a longer list
# bash
limit=5; for ((i=0 ; i<limit ; i++)); do
for ((j=0 ; j<limit ; j++)); do
for ((k=0 ; k<limit ; k++)); do
printf "%2d: %d %d %d\n" $((5*i + 7*j + 9*k)) $i $j $k;
done; done; done | sort -n | cat -n
1 0: 0 0 0
2 5: 1 0 0
3 7: 0 1 0
4 9: 0 0 1
5 10: 2 0 0
6 12: 1 1 0
7 14: 0 2 0
8 14: 1 0 1
9 15: 3 0 0
10 16: 0 1 1
11 17: 2 1 0
12 18: 0 0 2
13 19: 1 2 0
14 19: 2 0 1
15 20: 4 0 0
16 21: 0 3 0
17 21: 1 1 1
18 22: 3 1 0
19 23: 0 2 1
20 23: 1 0 2
21 24: 2 2 0
22 24: 3 0 1
23 25: 0 1 2
24 26: 1 3 0
25 26: 2 1 1
26 27: 0 0 3
27 27: 4 1 0
28 28: 0 4 0
29 28: 1 2 1
30 28: 2 0 2
31 29: 3 2 0
32 29: 4 0 1
33 30: 0 3 1
34 30: 1 1 2
35 31: 2 3 0
36 31: 3 1 1
37 32: 0 2 2
38 32: 1 0 3
39 33: 1 4 0
40 33: 2 2 1
41 33: 3 0 2
42 34: 0 1 3
43 34: 4 2 0
44 35: 1 3 1
45 35: 2 1 2
46 36: 0 0 4
47 36: 3 3 0
48 36: 4 1 1
49 37: 0 4 1
50 37: 1 2 2
51 37: 2 0 3
52 38: 2 4 0
53 38: 3 2 1
54 38: 4 0 2
55 39: 0 3 2
56 39: 1 1 3
57 40: 2 3 1
58 40: 3 1 2
59 41: 0 2 3
60 41: 1 0 4
61 41: 4 3 0
62 42: 1 4 1
63 42: 2 2 2
64 42: 3 0 3
65 43: 0 1 4
66 43: 3 4 0
67 43: 4 2 1
68 44: 1 3 2
69 44: 2 1 3
70 45: 3 3 1
71 45: 4 1 2
72 46: 0 4 2
73 46: 1 2 3
74 46: 2 0 4
75 47: 2 4 1
76 47: 3 2 2
77 47: 4 0 3
78 48: 0 3 3
79 48: 1 1 4
80 48: 4 4 0
81 49: 2 3 2
82 49: 3 1 3
83 50: 0 2 4
84 50: 4 3 1
85 51: 1 4 2
86 51: 2 2 3
87 51: 3 0 4
88 52: 3 4 1
89 52: 4 2 2
90 53: 1 3 3
91 53: 2 1 4
92 54: 3 3 2
93 54: 4 1 3
94 55: 0 4 3
95 55: 1 2 4
96 56: 2 4 2
97 56: 3 2 3
98 56: 4 0 4
99 57: 0 3 4
100 57: 4 4 1
101 58: 2 3 3
102 58: 3 1 4
103 59: 4 3 2
104 60: 1 4 3
105 60: 2 2 4
106 61: 3 4 2
107 61: 4 2 3
108 62: 1 3 4
109 63: 3 3 3
110 63: 4 1 4
111 64: 0 4 4
112 65: 2 4 3
113 65: 3 2 4
114 66: 4 4 2
115 67: 2 3 4
116 68: 4 3 3
117 69: 1 4 4
118 70: 3 4 3
119 70: 4 2 4
120 72: 3 3 4
121 74: 2 4 4
122 75: 4 4 3
123 77: 4 3 4
124 79: 3 4 4
125 84: 4 4 4
A std::vector<int>
does not have any constexpr
constructors (because dynamic memory allocation is disallowed for constexpr
). So you can't sort a std::vector<int>
at compile-time.
You can create a std::array<int, N>
at compile-time for a constant N
, but you'd have to write your own sorting routine because std::sort
is not constexpr
either.
You can also write a Boost.MPL compile-time vector or list and use the sort
routine of that. But this won't scale as well as std::array
.
Another angle of attack might be to store the vector into a static
variable and do the sorting at program initialization. Your program just takes a bit longer to start, but it won't affect the rest of its main functionality.
Since sorting is O(N log N)
, you might even have a two-step build and write the sorted vector to a file, and either compile/link it to your main program, or load it in O(N)
at program startup into a static
variable.
The classical approach for lengthy calculations that can be precomputed is to calculate the result as a part of the build process, generating a .cpp
that hardcodes the result (on platforms that have embedded resources these can be used as well). .
However, here the calculation is extremely simple, the slow part is probably just the allocation, which, if you want to keep the data in an std::vector
, has to happen at runtime. If you can live with a C-style array you could put it all in the executable as described above, but that would produce an executable 4 MBs larger, and the slowdown caused to load it from disk would offset any speed benefit of the precalculation.
IOW: precalculating at build time makes sense when the computation is expensive and the output is small. Your case is at the exact opposite of the spectrum, so I would avoid it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With