I need to store 4000 string of fixed size (8-char) in C#, but I do not know what is best to use regarding the space and time of adding and retrieving the item: Bloom filter, Hash table or Dictionary ? Please if any one can help me
Bloom filters do not store the data item at all. As we have seen they use bit array which allow hash collision.
Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives).
The Bloom Filter [1] is the extensively used probabilistic data structure for membership filtering. The query response of Bloom Filter is unbelievably fast, and it is in O(1) time complexity using a small space overhead. The Bloom Filter is used to boost up query response time, and it avoids some unnecessary searching.
A bloom filter also includes a set of k k k hash functions with which we hash incoming values. These hash functions must all have a range of 0 to m − 1 m - 1 m−1. If these hash functions match an incoming value with an index in the bit array, the bloom filter will make sure the bit at that position in the array is 1.
In this question, you really only have two data structures in C# since Dictionaries in C# are implemented using hash tables. So we'll refer to Dictionary and HashTable as both being hash tables. If you use one of them, then you probably want Dictionary due to type safety and performance as covered here: Why is Dictionary preferred over hashtable? But as a Dictionary is implemented using a hash table, it's not a huge difference either way.
But the real question is hash table (Dictionary) versus Bloom filter. Someone has previously asked the related question, What is the advantage to using bloom filters? They also link to the Wikipedia page on Bloom filters, which is quite informative: https://en.wikipedia.org/wiki/Bloom_filter The short versions of the answer is that Bloom filters are smaller and faster. They do, however, have a cost associated with this: they are not completely accurate. In a hash table, the original string is always stored for exact comparison. First you hash the value and this tells you where in the table to look. Once you've looked in the table, you then check the value located there against the value you're searching for. In a Bloom filter, you use multiple hashes to calculate a set of locations. If there are 1's in all of those locations, then you consider the string to be found. This means that sometimes strings will be "found" which were not originally inserted. If the table is too small, in fact, you could reach a saturation point where it would appear that any string you tried would be in the Bloom filter. Because you know how many strings you are going to be inserting, you can size the table appropriately to avoid this.
Let's look at the sizes involved. To make the numbers come out cleanly, I'm going to pretend that you have exactly 4096 strings. To have a relatively low-collision hash table, you would want your table to be at least as large as the number of strings. So, realistically (assuming 32 bit (4 byte) pointers), in this case, you'd be looking at a size of 4096*4 bytes = 16K for the table, plus 4096*(4+4+8) = 64K for the list nodes (next pointer + string pointer) and strings. So, in total, probably about 80K, which probably isn't very much memory in most situations where you would be using C#.
For Bloom filters, we have to decide the error rate we want to aim for in our size calculations. When we talk about a 1% error rate, it would mean that out of every 100 strings which were not inserted into the Bloom filter, 1 would be falsely indicated as being present. Strings which were inserted will always be correctly indicated as having been inserted. Using the equation m = -n*ln(p)/(ln(2)^2), we can calculate the minimum size to give us a certain error rate. In that equation, m is the number of slots in the table, p is the error rate, and n is the number of strings to be inserted. So, if we set p to be 0.01 (1% error), then we get approximately 9.6*4096 bits = 9.6*512 bytes = 4.8K, which is obviously quite a bit smaller. But, really, 1% is kind of high for an error rate. So more, realistically, we should probably go for something more like 0.0001% which comes out to 28.8*4096b bits = 28.8*512 bytes = 14.4K. Obviously, either of those are substantially smaller than the 80K we calculated for the hash table. However, the hash table has an error rate of 0 which is clearly less than either 1% or 0.0001%.
So, really, it's up to you whether or not, in your situation, the trade-off of losing some accuracy for gaining a little speed and a little time is worthwhile. Realistically, either option is likely to be small enough and fast enough for the vast majority of real world situations.
A dictionary is an abstract data type that represents a mapping from one type to another. It doesn't specify what the implementation of the dictionary is - it could be backed by a hash table, a balanced binary search tree, a skip list, or one of many other structures. It's probably not appropriate here, because a dictionary associates one type of elements with some other type. You're not doing this - you're just concerned with storing elements - so this is probably inappropriate.
A Bloom filter is a probabilistic data structure that is good for checking whether or not an element is definitely not in a set, but cannot tell you for sure whether something is in the set. It's commonly used in distributed systems to avoid unnecessary network reads. Each computer can store a Bloom filter of what entries might be in a database, and can filter out obviously unnecessary network calls by not querying a remote system if an entry is ruled out by the filter. It's not very good for what you're trying to do, since the false positives are probably a deal-breaker.
The hash table, though, is a great data structure for what you want. It supports fast lookups and insertions of elements and, with a good implementation, can be extremely memory efficient. However, it doesn't store the elements in sorted order, which may be a problem depending on your application.
If you do want sorted order, there are two other structures you might want to consider. The first would be a balanced binary search tree, which supports fast lookup and deletion and stores elements in sorted order. There are many good implementations out there; virtually all good programming languages ship with an implementation. The other is the trie, which supports very fast lookup and access and maintains sorted ordering. It can be a bit space-inefficient depending on the distribution of your strings, but might be exactly what you're looking for.
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With