Space-efficient probabilistic data structures for number retrieval

Tags:

Consider we have an algorithm that receives a hypothetically long stream of keys. It then generates a value between 0 and 1 for each key, as we process it, for posterior retrieval. The input set is large enough that we can't afford to store one value for each key. The value-generating rule is independent across keys.

Now, assume that we can tolerate error in the posterior lookup, but we want to still minimize the difference in retrieved and original values (i.e. asymptotically over many random retrievals).

For example, if the original value for a given key was 0.008, retrieving 0.06 is much better than retrieving 0.6.

What data structures or algorithms can we use to address this problem?

Bloom filters are the closest data structure that I can think of. One could quantize the output range, use a bloom filter for each bucket, and somehow combine their output at retrieval time to estimate the most likely value. Before I proceed with this path and reinvent the wheel, are there any known data structures, algorithms, theoretical or practical approaches to address this problem?

I am ideally looking for a solution that can parameterize the tradeoff between space and error rates.

729

asked Nov 12 '15 21:11

Amelio Vazquez-Reina

1 Answers

Perhaps a variant of the Bloom filter called Compact Approximator: like a bloom filter but generalized so the entries are values from a lattice. That lattice is here just floats between 0 and 1 (it has more structure than just being a lattice but it satisfies the requirements) or however you're storing those numbers.

An update replaces the relevant entries by the max between it and the value being remembered, a query computes the minimum of all its relevant entries (examples below). The results can only overestimate the true value. By reversing the ordering (swapping min and max and initializing to 1 instead of 0) you can get an underestimation, together giving an interval that contains the true value.

So for example, using the first approximated (overestimations), putting in a number looks like this:

index1 = hash1(key)
data[index1] = max(data[index1], value);
index2 = hash2(key)
data[index2] = max(data[index2], value);
... etc

And getting the overestimation looks like:

result = 1
index1 = hash1(key)
result = min(data[index1], result);
index2 = hash2(key)
result = min(data[index2], result);
... etc

answered Oct 20 '22 12:10

harold

Related questions
                            
                                Eliminating or understanding Jetty 9's "IllegalStateException: too much data after closed for HttpChannelOverHttp"
                            
                                Saving the state of a WebView and reloading the position
                            
                                How to guarantee get() of ConcurrentHashMap to always return the latest actual value?
                            
                                Asynchronous Cache Update With Spring Cache Abstraction
                            
                                How to retain trailing zeroes when converting BigDecimal to String
                            
                                Errors upgrading cordova app
                            
                                Override @Where clause condition in Hibernate 4.3.4
                            
                                Java: ever seen a compiler or tool that REJECTS a final comma in array initializer?
                            
                                Merging multiple LZO compressed files on HDFS
                            
                                Javapackager: Cannot find or load main class
                            
                                java debugging doesn't show current line correctly
                            
                                Interpret Java stack trace with lambdas such as ".lambda$null$2" & and "$$Lambda$"
                            
                                Spring boot not complaining about two beans with the same name
                            
                                Handling multiple basepath in swagger
                            
                                Is the Formatting Syntax of SimpleDateFormat a Java Invention?
                            
                                Java Introspection - weird behaviour
                            
                                Store/Retrieve Json data to/from Mysql 5.7 database using hibernate
                            
                                Hibernate is 1000 times slower than sql query
                            
                                Why isn't ReadWriteLock upgrade allowed?
                            
                                Spring Batch Serialization Problems with Java 8 time package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Space-efficient probabilistic data structures for number retrieval

Tags:

java

algorithm

data-structures

probability

bloom-filter

Amelio Vazquez-Reina

People also ask

1 Answers

harold

Recent Activity

Donate For Us