I recently came across the following interview question:
Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".
I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?
To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.
Description. Python string method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.
You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.
There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.
This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).
Using python, we can write two function, the first one segment
returns the first segmentation of a piece of contiguous text into words given a dictionary or None
if no such segmentation is found. Another function segment_all
returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.
The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.
def memo(func):
'''
Applies simple memoization to a function
'''
cache = {}
def closure(*args):
if args in cache:
v = cache[args]
else:
v = func(*args)
cache[args] = v
return v
return closure
def segment(text, words):
'''
Return the first match that is the segmentation of 'text' into words
'''
@memo
def _segment(text):
if text in words: return text
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix = _segment(suffix)
if prefix in words and segmented_suffix:
return '%s %s' % (prefix, segmented_suffix)
return None
return _segment(text)
def segment_all(text, words):
'''
Return a full list of matches that are the segmentation of 'text' into words
'''
@memo
def _segment(text):
matches = []
if text in words:
matches.append(text)
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix_matches = _segment(suffix)
if prefix in words and len(segmented_suffix_matches):
for match in segmented_suffix_matches:
matches.append('%s %s' % (prefix, match))
return matches
return _segment(text)
if __name__ == "__main__":
string = 'cargocultscience'
words = set('car cargo go cult science'.split())
print segment(string, words)
# >>> car go cult science
print segment_all(string, words)
# >>> ['car go cult science', 'cargo cult science']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With