Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

Tags:

I am working with a large set (5-20 million) of String keys (average length 10 chars) which I need to store in an in memory data structure that supports the following operation in constant time or near constant time:

// Returns true if the input is present in the container, false otherwise public boolean contains(String input)

Java's Hashmap is proving to be more than satisfactory as far as throughput is concerned but is taking up a lot of memory. I am looking for a solution that is memory efficient and still supports a throughput that is decent (comparable with or nearly as good as hashing).

I don't care about the insertion/deletion times. In my application, I will be performing only insertions (only at startup time) and will subsequently only be querying the data structure using the contains method for the life of the application.

I read that the HAT-Trie data structure is closest for my needs. I am wondering if there is a library that has an implementation.

Other suggestions with pointers to implementations welcome.

Thank You.

397

asked Feb 08 '10 00:02

hashable

2 Answers

The trie seems like a very good idea for your constraints.

A "thinking outside the box" alternative:

If you can afford some probability of answering "present" for a string that is absent

EDIT: if you can afford false positives, use a Bloom filter as suggested by WizardOfOdds in the comments.

For k=1, a Bloom filter is like a hash table without the keys: each "bucket" is simply a boolean that tells if at least one input with the same hash was present. If 1% false positives is acceptable, your hash table can be as small as about 100 * 20 million bits or roughly 200 MiB. For 1 in 1000 false positives, 2GiB.

Using several hash functions instead of one can improve the false positive rate for the same amount of bits.

114

answered Sep 19 '22 09:09

Pascal Cuoq

Google brings up a blog post on HAT tries in Java. But I don't see how this will solve your problem directly: the structure is a shallow trie over prefixes of the keys, with the leaves being hashtables holding the suffixes of all keys with the given prefix. So in total, you have a lot of hashtables storing all of the keys that are in your current one big hashtable (perhaps saving a few bytes per key overall because of the common prefixes). Either way, you need a more space-efficient hashtable than the default Java one, or the per-object overhead will hit you just as badly. So why not start with a specialized hashtable class for string keys only, if you take this route, and worry about the trie part only if it still seems worthwhile then?

answered Sep 21 '22 09:09

Darius Bacon

Related questions
                            
                                What parts of the JLS justify being able to throw checked exceptions as if they were unchecked?
                            
                                EnumMap or HashMap if lookup key is a String
                            
                                org.postgresql.util.PSQLException: The connection attempt failed
                            
                                Spring Data JPA NamedStoredProcedureQuery Multiple Out Parameters
                            
                                Centralised Java Logging
                            
                                Is iteration via Collections.synchronizedSet(...).forEach() guaranteed to be thread safe?
                            
                                Spring ThreadPoolTaskExecutor vs Java Executorservice cachedthreadpool [closed]
                            
                                Embedding resources (images, sound bits, etc) into a Java project then use those resources
                            
                                Compiler complains about "missing return statement" even though it is impossible to reach condition where return statement would be missing
                            
                                Default methods and interfaces extending other interfaces
                            
                                Slow application, frequent JVM hangs with single-CPU setups and Java 12+
                            
                                Java REST client API for Android [closed]
                            
                                Dealing with video (DVDs, .avi .mkv) in Java
                            
                                Docker: Combine multiple images
                            
                                Stateless Session Beans vs. Singleton Session Beans
                            
                                Is there an interactive interpreter for Java? [closed]
                            
                                Spring Data MongoDB: how to implement "entity relationships"?
                            
                                Does Stream.forEach respect the encounter order of sequential streams?
                            
                                Where to find Java 6 JSSE/JCE Source Code?
                            
                                How to get maven to timeout earlier while downloading dependencies?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Need memory efficient way to store tons of strings (was: HAT-Trie implementation in java)

Tags:

java

data-structures

hash

trie

bloom-filter

hashable

People also ask

2 Answers

Pascal Cuoq

Darius Bacon

Recent Activity

Donate For Us