Filtering out strings from a HashSet (or other collection)

Question

I am reading the contents of a txt file into a HashSet. The file contains almost every word in the English language, and each word becomes a String in the HashSet.

In my app, characters are being added to a String. I want to check whether this String is, or can become equal to, any of the Strings in the HashSet. That is, say the HashSet contains only the String apple. I have a String appl, and now I want to filter out the HashSet until it becomes a set with only Strings that start with appl (in this case a set with only apple).

I can iterate over the entire HashSet and use the startsWith(String) method, as I build a new filtered HashSet. But my initial HashSet is very big, so my question is: Is there a more efficient way to do this (perhaps using a different type of Collection?)

Some code of how I would do it right now:

private HashSet<String> filter(String partOfWord){
    HashSet<String> filteredSet = new HashSet<>();

    for (String word : dictionary) { // dictionary is the full HashSet
        if (word.startsWith(partOfWord)) {
            filteredSet.add(word);
        }
    }
    return filteredSet;
}

Bohemian · Accepted Answer

A trie is the ultimate weapon of doom for this task, but you can get good efficiency out of a TreeSet:

private TreeSet<String> dictionary;

private TreeSet<String> filter(String partOfWord) {
    return (TreeSet<String>)dictionary.subSet(partOfWord, partOfWord + "zzz");
}

Everything that start with "appl" is also between "appl" (inclusive if it's a word itself) and "applzzz" (no English word has 3 consecutive "z"'s in it), which is lexicographically greater than all words that start with "appl". The time complexity of the call to subset() is O(log n) to find the start of the subset and O(m) (m = number returned) for the range, which is pretty good.

Note that if you are able to reuse the returned set as your new dictionary as your word grows, you will have much more efficient code overall.

The cast to TreeSet<String> is needed because subSet() is a method of the SortedSet interface and returns a SortedSet, but it's covariant because the TreeSet implementation returns a view (another efficiency benefit), which is of course another TreeSet.

For improved efficiency, but uglier code, you could use a sorted String[] and Arrays.binarySearch(), then once you located your hit, you could quickly iterate along the array collection your hits.

Note that both the TreeSet and sorted array have O(log n) look-up time, whereas a HashSet (although unsuitable for the task) is O(1) look up time.

Filtering out strings from a HashSet (or other collection)

Tags:

java

string

Chronicle

1 Answers

Bohemian

Recent Activity

Donate For Us

Filtering out strings from a HashSet (or other collection)

Tags:

java

string

Chronicle

1 Answers

Bohemian

Related questions

Recent Activity

Donate For Us