Breaking a string apart into a sequence of words

Tags:

I recently came across the following interview question:

Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".

I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?

619

asked Aug 25 '11 19:08

grandmaster

3 Answers

Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.

You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.

There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.

119

answered Oct 25 '22 05:10

Daniel Tunkelang

This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).

answered Oct 25 '22 04:10

jmnwong

Using python, we can write two function, the first one segment returns the first segmentation of a piece of contiguous text into words given a dictionary or None if no such segmentation is found. Another function segment_all returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.

The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.

def memo(func):
    '''
    Applies simple memoization to a function
    '''
    cache = {}
    def closure(*args):
        if args in cache:
            v = cache[args]
        else:
            v = func(*args)
            cache[args] = v
        return v
    return closure


def segment(text, words):
    '''
    Return the first match that is the segmentation of 'text' into words
    '''
    @memo
    def _segment(text):
        if text in words: return text
        for i in xrange(1, len(text)):
            prefix, suffix = text[:i], text[i:]
            segmented_suffix = _segment(suffix)
            if prefix in words and segmented_suffix:
                return '%s %s' % (prefix, segmented_suffix)
        return None
    return _segment(text)


def segment_all(text, words):
    '''
    Return a full list of matches that are the segmentation of 'text' into words
    '''
    @memo
    def _segment(text):
        matches = []
        if text in words: 
            matches.append(text)
        for i in xrange(1, len(text)):
            prefix, suffix = text[:i], text[i:]
            segmented_suffix_matches = _segment(suffix)
            if prefix in words and len(segmented_suffix_matches):
                for match in segmented_suffix_matches:
                    matches.append('%s %s' % (prefix, match))
        return matches 
    return _segment(text)


if __name__ == "__main__":    
    string = 'cargocultscience'
    words = set('car cargo go cult science'.split())
    print segment(string, words)
    # >>> car go cult science
    print segment_all(string, words)
    # >>> ['car go cult science', 'cargo cult science']

answered Oct 25 '22 04:10

Rabih Kodeih

Related questions
                            
                                How to write a custom pretty printer
                            
                                Detect a specific frequency/tone from raw wave-data
                            
                                Algorithm to reach a number in a fixed amount of steps using addition, division and multiplication only
                            
                                Hindley-Milner algorithm in Java
                            
                                Finding repeating signed integers with O(n) in time and O(1) in space
                            
                                Using iterative style to clone an object in JavaScript
                            
                                Efficiently merging and re-sorting sorted lists
                            
                                Shortest Root to Leaf Path
                            
                                LP Simplex algorithm in C++ [closed]
                            
                                crossing edges in the travelling salesman problem
                            
                                Find the nearest dot in a 2D space
                            
                                Is it possible to do an inplace merge without temporary storage?
                            
                                Graph Theory: Calculating Clustering Coefficient
                            
                                Data for simple TSP [closed]
                            
                                Gray code increment function
                            
                                Efficient algorithm for comparing XML nodes
                            
                                Does dijkstras algorithm relax the edges of the shortest path in order?
                            
                                C# Diff Algorithm for Text [closed]
                            
                                Given a string of red and blue balls, find min number of swaps to club the colors together
                            
                                How to sort an integer array into negative, zero, positive part without changing relative position?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Breaking a string apart into a sequence of words

Tags:

algorithm

complexity-theory

search

search-engine