Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to break down a given text into words from the dictionary?

This is an interview question. Suppose you have a string text and a dictionary (a set of strings). How do you break down text into substrings such that each substring is found in the dictionary.

For example you can break down "thisisatext" into ["this", "is", "a", "text"] using /usr/share/dict/words.

I believe backtracking can solve this problem (in pseudo-Java):

void solve(String s, Set<String> dict, List<String> solution) {
   if (s.length == 0)
      return
   for each prefix of s found in dict
      solve(s without prefix, dict, solution + prefix)
}

List<String> solution = new List<String>()
solve(text, dict, solution)

Does it make sense? Would you optimize the step of searching the prefixes in the dictionary? What data structures would you recommend?

like image 941
Michael Avatar asked Jan 09 '12 18:01

Michael


3 Answers

This solution assumes the existence of Trie data structure for the dictionary. Further, for each node in Trie, assumes the following functions:

  1. node.IsWord() : Returns true if the path to that node is a word
  2. node.IsChild(char x): Returns true if there exists a child with label x
  3. node.GetChild(char x): Returns the child node with label x
Function annotate( String str, int start, int end, int root[], TrieNode node):
i = start
while i<=end:
    if node.IsChild ( str[i]):
        node = node.GetChild( str[i] )
        if node.IsWord():
            root[i+1] = start
        i+=1
    else:
        break;

end = len(str)-1
root = [-1 for i in range(len(str)+1)]
for start= 0:end:
    if start = 0 or root[start]>=0:
        annotate(str, start, end, root, trieRoot)

index  0  1  2  3  4  5  6  7  8  9  10  11
str:   t  h  i  s  i  s  a  t  e  x  t
root: -1 -1 -1 -1  0 -1  4  6 -1  6 -1   7

I will leave the part for you to list the words that make up the string by reverse traversing the root.

Time complexity is O(nk) where n is the length of the string and k is the length of the longest word in the dictionary.

PS: I am assuming following words in the dictionary: this, is, a, text, ate.

like image 197
ElKamina Avatar answered Oct 15 '22 17:10

ElKamina


There is a very thorough writeup for the solution to this problem in this blog post.

The basic idea is just to memoize the function you've written and you'll have an O(n^2) time, O(n) space algorithm.

like image 41
Rob Neuhaus Avatar answered Oct 15 '22 19:10

Rob Neuhaus


Approach 1- Trie looks to be a close fit here. Generate trie of the words in english dictionary. This trie building is one time cost. After trie is built then your string can be easily compared letter by letter. if at any point you encounter a leaf in the trie you can assume you found a word, add this to a list & move on with your traversal. Do the traversal till you have reached the end of your string. The list is output.

Time Complexity for search - O(word_length).

Space Complexity - O(charsize * word_length * no_words). Size of your dictionary.

Approach 2 - I have heard of Suffix Trees, never used them but it might be useful here.

Approach 3 - is more pedantic & a lousy alternative. you have already suggested this.

You could try the other way around. Run through the dict is check for sub-string match. Here I am assuming the keys in dict are the words of the english dictionary /usr/share/dict/words. So psuedo code looks something like this -

(list) splitIntoWords(String str, dict d)
{
    words = []
    for (word in d)
    {
        if word in str
            words.append(word);
    }
    return words;
}

Complexity - O(n) running through entire dict + O(1) for substring match.

Space - worst case O(n) if len(words) == len(dict)

As others have pointed out, this does require backtracking.

like image 4
Srikar Appalaraju Avatar answered Oct 15 '22 17:10

Srikar Appalaraju