Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient data structure that checks for existence of String

I am writing a program which will add a growing number or unique strings to a data structure. Once this is done, I later need to constantly check for existence of the string in it.

If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found (or reach the end and return false).

However, with a HashMap I know that in constant time I can simply use the key as a String and return any non-null object, making this operation faster. However, I am not keen on filling a HashMap where the value is completely arbitrary. Is there a readily available data structure that uses hash functions, but doesn't require a value to be placed?

like image 512
ddriver1 Avatar asked May 18 '14 12:05

ddriver1


3 Answers

It depends on many factors, including the number of strings you have to feed into that data structure (do you know the number by advance, or have a basic idea?), and what you expect the hit/miss ratio to be.

A very efficient data structure to use is a trie or a radix tree; they are basically made for that. For an explanation of how they work, see the wikipedia entry (a followup to the radix tree definition is in this page). There are Java implementations (one of them is here; however I have a fixed set of strings to inject, which is why I use a builder).

If your number of strings is really huge and you don't expect a minimal miss ratio then you might also consider using a bloom filter; the problem however is that it is probabilistic; but you can get very quick answers to "not there". Here also, there are implementations in Java (Guava has an implementation for instance).

Otherwise, well, a HashSet...

like image 85
fge Avatar answered Oct 18 '22 14:10

fge


If I were to use an ArrayList I believe checking for the existence of some specified string would iterate through all items until a matching string is found

Correct, checking a list for an item is linear in the number of entries of the list.

However, I am not keen on filling a HashMap where the value is completely arbitrary

You don't have to: Java provides a HashSet<T> class, which is very much like a HashMap without the value part.

You can put all your strings there, and then check for presence or absence of other strings in constant time;

Set<String> knownStrings = new HashSet<String>();
... // Fill the set with strings

if (knownString.contains(myString)) {
    ...
}
like image 25
Sergey Kalinichenko Avatar answered Oct 18 '22 14:10

Sergey Kalinichenko


A HashSet is probably the right answer, but if you choose (for simplicity, eg) to search a list it's probably more efficient to concatenate your words into a String with separators:

String wordList = "$word1$word2$word3$word4$...";

Then create a search argument with your word between the separators:

String searchArg = "$" + searchWord + "$";

Then search with, say, contains:

bool wordFound = wordList.contains(searchArg);

You can maybe make this a tiny bit more efficient by using StringBuilder to build the searchArg.

like image 31
Hot Licks Avatar answered Oct 18 '22 14:10

Hot Licks