Consider we have an algorithm that receives a hypothetically long stream of keys. It then generates a value between 0 and 1 for each key, as we process it, for posterior retrieval. The input set is large enough that we can't afford to store one value for each key. The value-generating rule is independent across keys.
Now, assume that we can tolerate error in the posterior lookup, but we want to still minimize the difference in retrieved and original values (i.e. asymptotically over many random retrievals).
For example, if the original value for a given key was 0.008, retrieving 0.06 is much better than retrieving 0.6.
What data structures or algorithms can we use to address this problem?
Bloom filters are the closest data structure that I can think of. One could quantize the output range, use a bloom filter for each bucket, and somehow combine their output at retrieval time to estimate the most likely value. Before I proceed with this path and reinvent the wheel, are there any known data structures, algorithms, theoretical or practical approaches to address this problem?
I am ideally looking for a solution that can parameterize the tradeoff between space and error rates.
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
The Probabilistic data structures and algorithms (PDSA) are a family of advanced approaches that are optimized to use fixed or sublinear memory and constant execution time; they are often based on hashing and have many other useful features.
A bloom filter is a probabilistic data structure that is based on hashing. It is extremely space efficient and is typically used to add elements to a set and test if an element is in a set. Though, the elements themselves are not added to a set. Instead a hash of the elements is added to the set.
Data Structures Bloom Filter is a probabilistic data structure which is used to search an element within a large set of elements in constant time that is O(K) where K is the number of hash functions being used in Bloom Filter. This is useful in cases where: the data to be searched is large.
Perhaps a variant of the Bloom filter called Compact Approximator: like a bloom filter but generalized so the entries are values from a lattice. That lattice is here just floats between 0 and 1 (it has more structure than just being a lattice but it satisfies the requirements) or however you're storing those numbers.
An update replaces the relevant entries by the max between it and the value being remembered, a query computes the minimum of all its relevant entries (examples below). The results can only overestimate the true value. By reversing the ordering (swapping min and max and initializing to 1 instead of 0) you can get an underestimation, together giving an interval that contains the true value.
So for example, using the first approximated (overestimations), putting in a number looks like this:
index1 = hash1(key)
data[index1] = max(data[index1], value);
index2 = hash2(key)
data[index2] = max(data[index2], value);
... etc
And getting the overestimation looks like:
result = 1
index1 = hash1(key)
result = min(data[index1], result);
index2 = hash2(key)
result = min(data[index2], result);
... etc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With