Design Add and Search Words Data Structure: Leetcode 211

Question

I am currently trying to solve the problem Add and Search Words Data Structure on leetcode. The question is as follows:

Design a data structure that supports adding new words and finding if a string matches any previously added string.

Implement the WordDictionary class:

WordDictionary() Initializes the object.

void addWord(word) Adds word to the data structure, it can be matched later.

bool search(word) Returns true if there is any string in the data structure that matches word or false otherwise. word may contain dots . where dots can be matched with any letter.

My Strategy:

My strategy involves representing a trie with a hashmap instead of a traditional linked-list-based tree structure, aiming for better performance and lower complexity. By using a hashmap, we can quickly access the next node without traversing through unnecessary nodes, making operations faster especially with large datasets.

For example, when inserting words like apple and application into this structure, it's organized as nested hashmaps where each character in a word points to another hashmap representing the next character. The end of a word is marked with a special key-value pair {'end': {}}. This way, we efficiently store and search for words with minimal space and time complexity.

My Code:

class WordDictionary(object):

    def __init__(self):
        self.map = {}

    def addWord(self, word):
        """
        :type word: str
        :rtype: None
        """
        current = self.map
        for i in word:
            if i in current:
                current = current[i]
            else:
                current[i] = {}
                current = current[i]
        current['end'] = {}
        
        return
        

    def search(self, word):
        """
        :type word: str
        :rtype: bool
        """
        current = self.map
        for i in word:
            if i in current:
                current = current[i]
            elif i == '.':
                current = {key: value for d in current.values() for key, value in d.items()}
            else:
                return False
        if 'end' in current:
            return True
        return False

The solution seems to be effective for the majority of cases, but I've hit a error with test case 16, where it's not giving the right outcome. The length of test case 16 makes it particularly challenging to pinpoint where the mistake is happening. I'm in need of some guidance to track down and fix this logical error. Would you be able to lend a hand in sorting this out?

J_H · Accepted Answer

Design a data structure

That's an interesting structure you have implemented. Sorry, I don't know how it ran afoul of test 16.

        current['end'] = {}

This costs N dict allocations if you store N words.

Consider instead simply storing a pointer to same old singleton, so we malloc() a little less:

        current['end'] = None

permutations

Sorted data structures tend to be good for wildcard matching. The trouble is that search('a.ple') might force us to scan all initial-"a" entries, with "a" being a popular letter. Or worse, search('.pple') might force a O(N) linear scan of everything. Assume we don't know etaoin shrdlu letter frequencies, and cannot measure them in a contest setting.

One way out of this is to store all permutations:

from sortedcontainers import SortedDict

d = SortedDict()
add(d, 'apple', 'apple')
add(d, 'pplea', 'apple')
add(d, 'pleap', 'apple')
add(d, 'leapp', 'apple')
add(d, 'eappl', 'apple')

def add(d: SortedDict, k: str, v: str) -> None:
    if k not in d:
        # The set deals with the ambiguity of anagrams,
        # e.g. {eat, ate, tea}
        d[k] = set()
    d[k].add(v)

Now when we're confronted with one or two wildcard characters in a search() query, it's a matter of identifying the longest stretch of definite letters. So for ".pp.e" we would scan d.irange("pp", "pq") looking for keys with an "e" in the correct position. For a query like "a..le" we would scan d.irange("lea", "leb").

You can't pip install packages for the contest, but you can certainly implement enough binary search code to support this.

anagrams

Let's lean on scrambled letter order a bit harder.

Suppose we want to find anagrams of a query word. The standard solution is to sort letters and store that. So ''.join(sorted('apple')) -> 'aelpp'

A catastrophic query for this scheme would be '.ppl.'. That seems to send us back to storage proportional to word length. Let's try storing a second copy, the reverse sort: 'pplea'. Now we can zip() up these two iterators:

d.irange('lpp', 'lpq')
d.irange('ppl', 'ppm')

Whichever one produces a match first will win, terminating the loop. This scheme has a nice symmetry: when adding a word and when querying we can always alphabetize however many letters we've been given.

Consider an example that lacks repeated characters.

'fruit' --> 'firtu'
'fruit' --> 'utrif'

Catastrophic query for this scheme would be '.r.it' So maybe add a third entry, permutation from middle (e.g. 'rtufi'), and call it a day? Choosing three entries exploits the fact that adversary is limited to just two wildcards.

I have been trying to avoid storing number of entries that is quadratic with word length, since 15 characters is max word length, and 15 × 14 = 210 is a large number. Maybe storing entries for all possible single-wildcard queries is the right tradeoff here.

For long words, a wildcard knocking out 'm' or 'n' seems uninteresting; only the first few letters close to 'a', and the last few letters close to 'z' have much power to wreck a query. If e.g. 'b' and 'c' are knocked out by wildcards, then the reverse copy will save us. Similarly if 's' and 'u' are knocked out, the forward copy is sure to permit efficient querying. So number of entries can go up sub-linearly with word size.

letter count

Oh, one more thing that is known when adding and when querying: the letter count. So first index of our dictionary should be word_len, followed by sorted_letters or whatever:

d[5]['firtu'].add('fruit')

Neil · Answer

One thing searching in a trie shares with binary search is that it keeps shrinking the possibilities between a range; except instead of binary, it's character-by-character, and instead of dividing down the binary middle, it uses the position in the word. (It still performs with log n iid, see Tong2016Smoothed.)

However, when it gets to a wildcard (.), one can't add to the possibilities and do them in parallel. In general, we will have holes in the output such that it is not represented in a single range. For example, when searching for c.t in {cat, cel, cut}, cat and cut are included but cel, in the middle, is not. I think I've modified this code to do this by taking multiple paths.

class WordDictionary(object):

    def __init__(self):
        self.node = None
        self.map = {}

    def __repr__(self):
        if self.node and self.map:
            return "\"{0}\".{1}".format(self.node, self.map)
        elif self.node:
            return "\"{0}\"".format(self.node)
        elif self.map:
            return "∅.{0}".format(self.map)
        else:
            return "<error>"

    def addWord(self, word: str) -> None:
        current = self
        for i in word:
            if not i in current.map:
                current.map[i] = WordDictionary()
            current = current.map[i]
        current.node = word

    def match(self, word: str) -> str:
        if not word:
            """
            in-order sub-trie traversal will enumerate the words that have
            `word` as a prefix (hence prefix-tree); (not used)
            """
            if self.node:
                yield self.node
            return
        i = word[0]
        rest = word[1:]
        if i == '.':
            for wild in self.map.values():
                yield from wild.match(rest)
        elif i in self.map:
            yield from self.map[i].match(rest)

    def search(self, word: str) -> bool:
        return next(self.match(word), None) != None

words = WordDictionary()
words.addWord("zulu")
words.addWord("november")
words.addWord("mike")
words.addWord("yankee")
words.addWord("cat")
words.addWord("cate")
words.addWord("cel")
words.addWord("cut")
words.addWord("cute")
print("words: {0}".format(words))

print("first match a:", next(words.match("a"), None))
print("first match mi..:", next(words.match("mi.."), None))
print("search a:", words.search("a"))
print("search c.t:", words.search("c.t"))
print("search z.lu:", words.search("z.lu"))
print("search mi..:", words.search("mi.."))
print("search ..:", words.search(".."))
print("search ....:", words.search("...."))
print("search c.te", words.search("c.te"))

The match function is a (recursive) generator for all the words that could be possible using the string with wildcards. The self.node string is a more expressive form of the end sentinel; end could be used, but the way I've written search requires that any words are returned from match. (Sorry for the python3 instead of python, this is not my usual language at all.)

Design Add and Search Words Data Structure: Leetcode 211

Tags:

python

dictionary

hashmap

trie

Kanchon Gharami

2 Answers

permutations

anagrams

letter count

J_H

Neil

Recent Activity

Donate For Us

Design Add and Search Words Data Structure: Leetcode 211

Tags:

python

dictionary

hashmap

trie

Kanchon Gharami

2 Answers

permutations

anagrams

letter count

J_H

Neil

Related questions

Recent Activity

Donate For Us