A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.
The probability of a false positive – or false positive rate – of a Bloom filter is a function of the randomness of the values generated by the hash functions and of m, n, and k (n is the number of objects mapped into the Bloom filter).
Bloom filters do not store the items themselves and they use less space than the lower theoretical limit required to store the data correctly, and therefore, they exhibit an error rate. They have false positives but they do not have false negatives, and the one-sidedness of this error can be turned to our benefit.
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username.
Yes, a lossy hash table or a LRUCache is a data structure with fast O(1) lookup that will only give false negatives -- if you ask if "Have I run test X", it will tell you either "Yes, you definitely have", or "I can't remember".
Forgive the extremely crude pseudocode:
setup_test_table():
create test_table( some large number of entries )
clear each entry( test_table, NEVER )
return test_table
has_test_been_run_before( new_test_details, test_table ):
index = hash( test_details , test_table.length )
old_details = test_table[index].detail
// unconditionally overwrite old details with new details, LRU fashion.
// perhaps some other collision resolution technique might be better.
test_table[index].details = new_test_details
if ( old_details === test_details ) return YES
else if ( old_details === NEVER ) return NEVER
else return PERHAPS
main()
test_table = setup_test_table();
loop
test_details = generate_random_test()
status = has_test_been_run_before( test_details, test_table )
case status of
YES: do nothing;
NEVER: run test (test_details);
PERHAPS: if( rand()&1 ) run test (test_details);
next loop
end.
The exact data structure that accomplishes this task is a Direct-mapped cache, and is commonly used in CPUs.
function set_member(set, item)
set[hash(item) % set.length] = item
function is_member(set, item)
return set[hash(item) % set.length] == item
Is it possible to store the tests that you did not run? This should inverse the filter's behavior.
How about an LRUCache?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With