Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deterministic bit scrambling to filter coordinates

I am trying to write a function that, given an (x,y) coordinate pair and the random seed of the program, will psuedo-randomly return true for some preset percentage of all such pairs. There are no limits on x or y beyond the restrictions of the data type, which is a 32-bit signed int.

My current approach is to scramble the bits of x, y, and the seed together and then compare the resulting number to the percentage:

float percentage = 0.005;
...
unsigned int n = (x ^ y) ^ seed;
return (((float) n / UINT_MAX) < percentage);

However, it seems that this approach would be biased for certain values of x and y. For example, if it returns true for (0,a), it will also return true for (a,0).

I know this implementation that just XORs them together is naive. Is there a better bit-scrambling algorithm to use here that will not be biased?

Edit: To clarify, I am not starting with a set of (x,y) coordinates, nor am I trying to get a fixed-size set of coordinates that evaluate to true. The function should be able to evaluate a truth value for arbitrary x, y, and seed, with the percentage controlling the average frequency of "true" coordinates.

like image 307
aosdict Avatar asked Jan 14 '15 01:01

aosdict


2 Answers

The easy solution is to use a good hashing algorithm. You can do the range check on the value of hash(seed || x || y).

Of course, selecting points individually with percentage p does not guarantee that you will end up with a sample whose size will be exactly p * N. (That's the expected size of the sample, but any given sample will deviate a bit.) If you want to get a sample of size precisely k from a universe of N objects, you can use the following simple algorithm:

  • Examine the elements in the sample one at a time until k reaches 0.

  • When examining element i, add it to the sample if its hash value mapped onto the range [0, N-i) is less than k. If you add the element to the sample, decrement k.

There's no way to get the arithmetic absolutely perfect (since there is no way to perfectly partition 2i different hash values into n buckets unless n is a power of 2), so there will always be a tiny bias. (Floating point arithmetic does not help; the number of possible floating point values is also fixed, and suffers from the same bias.)

If you do 64-bit arithmetic, the bias will be truly tiny, but the arithmetic is more complicated unless your environment provides a 128-bit multiply. So you might feel satisfied with 32-bit computations, where the bias of one in a couple of thousand million [Note 1] doesn't matter. Here, you can use the fact that any 32 bits in your hash should be as unbiased as any other 32 bits, assuming your hash algorithm is any good (see below). So the following check should work fine:

// I need k elements from a remaining universe of n, and I have a 64-bit hash.
// Return true if I should select this element
bool select(uint32_t n, uint32_t k, uint64_t hash) {
  return ((hash & (uint32_t)(-1)) * (uint64_t)n) >> 32 < k;
}

// Untested example sampler
// select exactly k elements from U, using a seed value
std::vector<E> sample(const std::vector<E>& U, uint64_t seed, uint32_t k) {
  std::vector<E> retval;
  uint32_t n = U.size();
  for (uint32_t n = U.size(); k && n;) {
    E& elt = U[--n];
    if (select(n, k, hash_function(seed, elt))) {
      retval.push_back(elt);
      --k;
    }
  }
  return retval;
}

Assuming you need to do this a lot, you'll want to use a fast hash algorithm; since you're not actually working in a secure environment, you don't need to worry about whether the algorithm is cryptographically secure.

Many high-speed hashing algorithms work on 64-bit units, so you could maximize the speed by constructing a 128-bit input consisting of a 64-bit seed and the two 32-bit co-ordinates. You can then unroll the hash loop to do exactly two blocks.

I won't venture a guess at the best hash function for your purpose. You might want to check out one or more of these open-source hashing functions:

  • Farmhash https://code.google.com/p/farmhash/
  • Murmurhash https://code.google.com/p/smhasher/
  • xxhash https://code.google.com/p/xxhash/
  • siphash https://github.com/majek/csiphash/

... and many more.


Notes

  1. A couple of billion, if you're on that side of the Atlantic.
like image 103
rici Avatar answered Nov 11 '22 00:11

rici


I would prefer feeding seed, x, and y through a Combined Linear Congruential Generator.

This is generally much faster than hashing, and it is designed specifically for the purpose: To output a pseudo-random number uniformly in a certain range.

Using coefficients recommended by Wichmann-Hill (which are also used in some versions of Microsoft Excel) we can do:

si = 171 * s % 30269;
xi = 172 * x % 30307;
yi = 170 * y % 30323;

r_combined = fmod(si/30269. + xi/30307. + yi/30323., 1.);

return r_combined < percentage;

Where s is the seed on the first call, and the previous si on each subsequent call. (Thanks to rici's comment for this point.)

like image 24
Imran Avatar answered Nov 11 '22 01:11

Imran