Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Breaking a string apart into a sequence of words

I recently came across the following interview question:

Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".

I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?

like image 619
grandmaster Avatar asked Aug 25 '11 19:08

grandmaster


People also ask

How do I split a string into a list of words?

To convert a string in a list of words, you just need to split it on whitespace. You can use split() from the string class. The default delimiter for this method is whitespace, i.e., when called on a string, it'll split that string at whitespace characters.

What does the split () method return from a list of words?

Description. Python string method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.


3 Answers

Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.

You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.

There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.

like image 119
Daniel Tunkelang Avatar answered Oct 25 '22 05:10

Daniel Tunkelang


This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).

like image 33
jmnwong Avatar answered Oct 25 '22 04:10

jmnwong


Using python, we can write two function, the first one segment returns the first segmentation of a piece of contiguous text into words given a dictionary or None if no such segmentation is found. Another function segment_all returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.

The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.

def memo(func):
    '''
    Applies simple memoization to a function
    '''
    cache = {}
    def closure(*args):
        if args in cache:
            v = cache[args]
        else:
            v = func(*args)
            cache[args] = v
        return v
    return closure


def segment(text, words):
    '''
    Return the first match that is the segmentation of 'text' into words
    '''
    @memo
    def _segment(text):
        if text in words: return text
        for i in xrange(1, len(text)):
            prefix, suffix = text[:i], text[i:]
            segmented_suffix = _segment(suffix)
            if prefix in words and segmented_suffix:
                return '%s %s' % (prefix, segmented_suffix)
        return None
    return _segment(text)


def segment_all(text, words):
    '''
    Return a full list of matches that are the segmentation of 'text' into words
    '''
    @memo
    def _segment(text):
        matches = []
        if text in words: 
            matches.append(text)
        for i in xrange(1, len(text)):
            prefix, suffix = text[:i], text[i:]
            segmented_suffix_matches = _segment(suffix)
            if prefix in words and len(segmented_suffix_matches):
                for match in segmented_suffix_matches:
                    matches.append('%s %s' % (prefix, match))
        return matches 
    return _segment(text)


if __name__ == "__main__":    
    string = 'cargocultscience'
    words = set('car cargo go cult science'.split())
    print segment(string, words)
    # >>> car go cult science
    print segment_all(string, words)
    # >>> ['car go cult science', 'cargo cult science']
like image 36
Rabih Kodeih Avatar answered Oct 25 '22 04:10

Rabih Kodeih