Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a sequence which is ordered by bits set

Tags:

c++

math

I'm looking for a reversible function unsigned f(unsigned) for which the number of bits set in f(i) increases with i, or at least does not decrease. Obviously, f(0) has to be 0 then, and f(~0) must come last. In between there's more flexibility. After f(0), the next 32* values must be 1U<<0 to 1U<<31, but I don't care a lot about the order (they all have 1 bit set).

I'd like an algorithm which doesn't need to calculate f(0)...f(i-1) in order to calculate f(i), and a complete table is also unworkable.

This is similar to Gray codes, but I can't see a way to reuse that algorithm. I'm trying to use this to label a large data set, and prioritize the order in which I search them. The idea is that I have a key C, and I'll check labels C ^ f(i). Low values of i should give me labels similar to C, i.e. differing in only a few bits.

[*] Bonus points for not assuming that unsigned has 32 bits.

[example] A valid initial sequence:

0, 1, 2, 4, 16, 8 ... // 16 and 8 both have one bit set, so they compare equal

An invalid initial sequence:

0, 1, 2, 3, 4 ... // 3 has two bits set, so it cannot precede 4 or 2147483648.
like image 662
MSalters Avatar asked Oct 10 '13 13:10

MSalters


Video Answer


2 Answers

Ok, seems like I have a reasonable answer. First let's define binom(n,k) as the number of ways in which we can set k out of n bits. That's the classic Pascal triangle:

1  1
1  2  1
1  3  3  1
1  4  6  4  1
1  5 10 10  5  1
1  6 15 20 15  6  1
1  7 21 35 35 21  7  1
1  8 28 56 70 56 28  8  1
...

Easily calculated and cached. Note that the sum of each line is 1<<lineNumber.

The next thing we'll need is the partial_sum of that triangle:

1  2
1  3  4
1  4  7  8
1  5 11 15 16
1  6 16 26 31  32
1  7 22 42 57  63  64
1  8 29 64 99  120 127 128
1  9 37 93 163 219 247 255 256 
...

Again, this table can be created by summing two values from the previous line, except that the new entry on each line is now 1<<line instead of 1.

Let's use these tables above to construct f(x) for an 8 bits number (it trivially generalizes to any number of bits). f(0) still has to be 0. Looking up the 8th row in the first triangle, we see that next 8 entries are f(1) to f(9), all with one bit set. The next 28 entries (7+6+5+4+3+2+1) all have 2 bits set, so that's f(10) to f(37). The next 56 entries, f(38) to f(93) have 3 bits, and there are 70 entries with 4 bits set. From symmetry we can see that they're centered around f(128), in particular they're f(94) to f(163). And obviously, the only number with 8 bits set sorts last, as f(255).

So, with these tables we can quickly determine how many bits must be set in f(i). Just do a binary search in the last row of your table. But that doesn't answer exactly which bits are set. For that we need the previous rows.

The reason that each value in the table can be created from the previous line is simple. binom(n,k) == binom(k, n-1) + binom(k-1, n-1). There are two sorts of numbers with k bits set: Those that start with a 0... and numbers which start with 1.... In the first case, the next n-1 bits must contain those k bits, in the second case the next n-1 bits must contain only k-1 bits. Special cases are of course 0 out of n and n out of n.

This same stucture can be used to quickly tell us what f(16) must be. We already had established that it must contain 2 bits set, as it falls in the range f(10) - f(37). In particular, it's number 6 with 2 bits set (starting as usual with 0). It's useful to define this as an offset in a range as we'll try to shrink the length this range from 28 down to 1.

We now subdivide that range into 21 values which start with a zero and 7 which start a one. Since 6 < 21, we know that the first digit is a zero. Of the remaining 7 bits, still 2 need to be set, so we move up a line in the triangle and see that 15 values start with two zeroes, and 6 start with 01. Since 6 < 15, f(16) starts with 00. Going further up, 7 <= 10 so it starts with 000. But 6 == 6, so it doesn't start with 0000 but 0001. At this point we change the start of the range, so the new offset becomes 0 (6-6)

We know need can focus only on the numbers that start with 0001 and have one extra bit, which are f(16)...f(19). It should be obvious by know that the range is f(16)=00010001, f(17)=00010010, f(18)=00010100, f(19)=00011000.

So, to calculate each bit, we move one row up in the triangle, compare our "remainder", add a zero or one based on the comparison possibly go left one column. That means the computational complexity of f(x) is O(bits), or O(log N), and the storage needed is O(bits*bits).

like image 101
MSalters Avatar answered Sep 22 '22 15:09

MSalters


For each given number k we know that there are binom(n, k) n-bit integers that have exactly k bits of value one. We can now generate a lookup table of n + 1 integers that store for each k how many numbers have less one bits. This lookup table can then be used to find the number o of one bits of f(i).

Once we know this number we subtract the lookup table value for this number of bits from i which leaves us with the permutation index p for numbers with the given number of one bits. Altough I have not done research in this area I am quite sure that there exists a method for finding the pth permutation of a std::vector<bool> which is initialized with zeros and o ones in the lowest bits.

The reverse function

Again the lookup table comes in handy. We can directly calculate the number of preceding numbers with less one bits by counting the one bits in the input integer and reading in the lookup table. Then you "only" need to determine the permutation index and add it to the looked up value and you are done.

Disclaimer

Of course this is only a rough outline and some parts (especially involving the permutations) might take longer than it sounds.

Addition

You stated yourself

I'm trying to use this to label a large data set, and prioritize the order in which I search them.

Which sounds to me as if you would be going from the low hamming distance to the high hamming distance. In this case it would be enough to have an incremental version which generates the next number from the previous:

unsigned next(unsigned previous)
{
    if(std::next_permutation(previous))
        return previous;
    else
        return (1 << (1 + countOneBits(previous))) - 1;
}

Of course std::next_permutation permutation does not work this way but I think it is clear how I mean to use it.

like image 24
Nobody moving away from SE Avatar answered Sep 21 '22 15:09

Nobody moving away from SE