I am currently storing a list of words (around 120,000) in a HashSet, for the purpose of using as a list to check enetered words against to see if they are spelt correctly, and just returning yes or no.
I was wondering if there is a way to do this which takes up less memory. Currently 120,000 words is around 12meg, the actual file the words are read from is around 900kb.
Any suggestions?
Thanks in advance
You could use a prefix tree or trie: http://en.wikipedia.org/wiki/Trie
Check out bloom filters or cuckoo hashing. Bloom filter or cuckoo hashing?
I am not sure if this is the answer for your question but worth looking into these alternatives. bloom filters are mainly used for spell checker kind of use cases.
HashSet
is probably not the right structure for this. Use Trie instead.
This might be a bit late but using Google you can easily find the DAWG investigation and C code that I posted a while ago.
http://www.pathcom.com/~vadco/dawg.html
TWL06 - 178,691 words - fits into 494,676 Bytes
The downside of a compressed-shared-node structure is that it does not work as a hash function for the words in your list. That is to say, it will tell you if a word exists, but it will not return an index to related data for a word that does exist.
If you want the perfect and complete hash functionality, in a processor-cache sized structure, you are going to have to read, understand, and modify a data structure called the ADTDAWG. It will be slightly larger than a traditional DAWG, but it is faster and more useful.
http://www.pathcom.com/~vadco/adtdawg.html
All the very best,
JohnPaul Adamovsky
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With