I am trying to find a more efficient solution to a combinatorics problem than the solution I have already found.
Suppose I have a set of N objects (indexed 0..N-1) and wish to consider each subset of size K (0<=K<=N). There are S=C(N,K) (i.e., "N choose K") such subsets. I wish to map (or "encode") each such subset to a unique integer in the range 0..S-1.
Using N=7 (i.e., indexes are 0..6) and K=4 (S=35) as an example, the following mapping is the goal:
0 1 2 3 --> 0
0 1 2 4 --> 1
...
2 4 5 6 --> 33
3 4 5 6 --> 34
N and K were chosen small for the purposes of illustration. However, in my actual application, C(N,K) is far too large to obtain these mappings from a lookup table. They must be computed on-the-fly.
In the code that follows, combinations_table
is a pre-computed two-dimensional array for fast lookup of C(N,K) values.
All code given is compliant with the C++14 standard.
If the objects in a subset are ordered by increasing order of their indexes, the following code will compute that subset's encoding:
template<typename T, typename T::value_type N1, typename T::value_type K1>
typename T::value_type combination_encoder_t<T, N1, K1>::encode(const T &indexes)
{
auto offset{combinations_table[N1][K1] - combinations_table[N1 - indexes[0]][K1]};
for (typename T::value_type index{1}; index < K1; ++index)
{
auto offset_due_to_current_index{
combinations_table[N1 - (indexes[index-1] + 1)][K1 - index] -
combinations_table[N1 - indexes[index]][K1 - index]
};
offset += offset_due_to_current_index;
}
return offset;
}
Here, template parameter T will be either a std::array<>
or std::vector<>
holding a collection of indexes we wish to find the encoding for.
This is essentially an "order-preserving minimal perfect hash function", as can be read about here:
https://en.wikipedia.org/wiki/Perfect_hash_function
In my application, the objects in a subset are already naturally ordered at the time of encoding, so I do not incur the added running time of a sort operation. Therefore, my total running time for encoding is that of the algorithm presented above, which has O(K) running time (i.e., linear in K and not dependent on N).
The code above works fine. The interesting part is trying to invert this function (i.e., to "decode" an encoded value back into the object indexes that produced it).
For decoding, I could not come up with a solution with linear running time.
Instead of direct calculation of the indexes that correspond to an encoded value (which would be O(K)), I ended up implementing a binary search of the index space to find them. This results in a running time that is (no worse than, but which we'll call) O(K*lg N). The code to do this is as follows:
template<typename T, typename T::value_type N1, typename T::value_type K1>
void combination_encoder_t<T, N1, K1>::decode(const typename T::value_type encoded_value, T &indexes)
{
typename T::value_type offset{0};
typename T::value_type previous_index_selection{0};
for (typename T::value_type index{0}; index < K1; ++index)
{
auto lowest_possible{index > 0 ? previous_index_selection + 1 : 0};
auto highest_possible{N1 - K1 + index};
// Find the *highest* ith index value whose offset increase gives a
// total offset less than or equal to the value we're decoding.
while (true)
{
auto candidate{(highest_possible + lowest_possible) / 2};
auto offset_increase_due_to_candidate{
index > 0 ?
combinations_table[N1 - (indexes[index-1] + 1)][K1 - index] -
combinations_table[N1 - candidate][K1 - index]
:
combinations_table[N1][K1] -
combinations_table[N1 - candidate][K1]
};
if ((offset + offset_increase_due_to_candidate) > encoded_value)
{
// candidate is *not* the solution
highest_possible = candidate - 1;
continue;
}
// candidate *could* be the solution. Check if it is by checking if candidate + 1
// could be the solution. That would rule out candidate being the solution.
auto next_candidate{candidate + 1};
auto offset_increase_due_to_next_candidate{
index > 0 ?
combinations_table[N1 - (indexes[index-1] + 1)][K1 - index] -
combinations_table[N1 - next_candidate][K1 - index]
:
combinations_table[N1][K1] -
combinations_table[N1 - next_candidate][K1]
};
if ((offset + offset_increase_due_to_next_candidate) <= encoded_value)
{
// candidate is *not* the solution
lowest_possible = next_candidate;
continue;
}
// candidate *is* the solution
offset += offset_increase_due_to_candidate;
indexes[index] = candidate;
previous_index_selection = candidate;
break;
}
}
}
Can this be improved on? I'm looking for two categories of improvements:
A minimal perfect hash function is a perfect hash function that maps n keys to n consecutive integers – usually the numbers from 0 to n − 1 or from 1 to n. A more formal way of expressing this is: Let j and k be elements of some finite set S.
A perfect hash function can be constructed that maps each of the keys to a distinct integer, with no collisions. These functions only work with the specific set of keys for which they were constructed. Passing an unknown key will result a false match or even crash. A minimal perfect hash function goes one step further.
A simple implementation of order-preserving minimal perfect hash functions with constant access time is to use an (ordinary) perfect hash function or cuckoo hashing to store a lookup table of the positions of each key.
A hash function is k -perfect if at most k elements from S are mapped onto the same value in the range. The "hash, displace, and compress" algorithm can be used to construct k -perfect hash functions by allowing up to k collisions. The changes necessary to accomplish this are minimal, and are underlined in the adapted pseudocode below:
Minimal perfect hash functions totally avoid the problem of wasted space and time. A perfect hash function is order preservingif the keys in are arranged in some given order and preserves this order in the hash table. Figure 1:(a) Perfect hash function. (b) Minimal perfect hash function.
The universal hash has unused buckets and collisions. The perfect hash has no collisions, and the MPH has neither collisions nor unused buckets. Why do you want a minimal perfect hash? Tables constructed using minimal perfect hash (MPH) functions tend to be smaller than normal hash tables because there are no empty memory locations in the table.
Take a look at the recursive formula for combinations:
Suppose you have a combination space C(n,k)
. You can divide that space into two subspaces:
C(n-1,k-1)
all combinations, where the first element of the original set (of length n
) is presentC(n-1, k)
where first element is not presetIf you have an index X that corresponds to a combination from C(n,k)
, you can identify whether the first element of your original set belongs to the subset (which corresponds to X
), if you check whether X
belongs to either subspace:
X < C(n-1, k-1)
: belongsX >= C(n-1, k-1)
: doesn't belongThen you can recursively apply the same approach for C(n-1, ...)
and so on, until you've found the answer for all n
elements of the original set.
Python code to illustrate this approach:
import itertools, math
n=7
k=4
stuff = list(range(n))
# function that maps x into the corresponding combination
def rec(x, n, k, index):
if n==0 and k == 0:
return index
# C(n,k) = C(n-1,k-1) + C(n-1, k)
# C(n,0) = C(n,n) = 1
c = math.comb(n-1, k-1) if k > 0 else 0
if x < c:
index.add(stuff[len(stuff)-n])
return rec(x, n-1, k-1, index)
else:
return rec(x - c, n-1, k, index)
# Test:
for i,eta in enumerate(itertools.combinations(stuff, k)):
comb = rec(i, n, k, set())
print(f'{i} {eta} {comb}')
Produced output:
0 (0, 1, 2, 3) {0, 1, 2, 3}
1 (0, 1, 2, 4) {0, 1, 2, 4}
2 (0, 1, 2, 5) {0, 1, 2, 5}
3 (0, 1, 2, 6) {0, 1, 2, 6}
4 (0, 1, 3, 4) {0, 1, 3, 4}
5 (0, 1, 3, 5) {0, 1, 3, 5}
...
33 (2, 4, 5, 6) {2, 4, 5, 6}
34 (3, 4, 5, 6) {3, 4, 5, 6}
This approach is O(n)
(while your approach seems to be O( k * log(n) )
(?) ), and it should have fairly small constant if rewritten iteratively. I'm not sure if it will yield an improvement (needs to be tested).
I also wonder how large your typical k
and n
values are? I assume they should be small enough so that C(n,k) still fits into 64bits?
Of course, you can use precomputed tables instead of math.comb
, replace recursion with iteration (it's tail recursion, so you don't need stack), and use array instead of the set for the result.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With