Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word segmentation using dynamic programming

So first off I'm very new to Python so if I'm doing something awful I'm prefacing this post with a sorry. I've been assigned this problem:

We want to devise a dynamic programming solution to the following problem: there is a string of characters which might have been a sequence of words with all the spaces removed, and we want to find a way, if any, in which to insert spaces that separate valid English words. For example, theyouthevent could be from “the you the vent”, “the youth event” or “they out he vent”. If the input is theeaglehaslande, then there’s no such way. Your task is to implement a dynamic programming solution in two separate ways:

  • iterative bottom-up version
  • recursive memorized version

Assume that the original sequence of words had no other punctuation (such as periods), no capital letters, and no proper names - all the words will be available in a dictionary file that will be provided to you.

So I'm having two main issues:

  1. I know that this can and should be done in O(N^2) and I don't think mine is
  2. The lookup table isn't adding all the words it seems such that it can reduce the time complexity

What I'd like:

  1. Any kind of input (better way to do it, something you see wrong in the code, how I can get the lookup table working, how to use the table of booleans to build a sequence of valid words)
  2. Some idea on how to tackle the recursive version although I feel once I am able to solve the iterative solution I will be able to engineer the recursive one from it.

As always thanks for any time and or effort anyone gives this, it is always appreciated.

Here's my attempt:

#dictionary function returns True if word is found in dictionary false otherwise
def dictW(s):
    diction = open("diction10k.txt",'r') 
    for x in diction:
        x = x.strip("\n \r")
        if s == x:
            return True
    return False

def iterativeSplit(s):
    n = len(s)
    i = j = k = 0
    A = [-1] * n
    word = [""] * n
    booly = False
    for i in range(0, n):
        for j in range(0, i+1):
            prefix = s[j:i+1]
            for k in range(0, n):

                if word[k] == prefix:
                    #booly = True
                    A[k] = 1
                    #print "Array below at index k %d and word = %s"%(k,word[k])
                    #print A
            # print prefix, A[i]
            if(((A[i] == -1) or (A[i] == 0))):
                if (dictW(prefix)):
                    A[i] = 1
                    word[i] = prefix
                    #print word[i], i
                else:
                    A[i] = 0
    for i in range(0, n):
        print A[i]
like image 232
xe0 Avatar asked Mar 05 '14 09:03

xe0


1 Answers

For another real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the memoization approach.

In particular, segment illustrates the memoization approach:

def segment(text):
    "Return a list of words that is the best segmenation of `text`."

    memo = dict()

    def search(text, prev='<s>'):
        if text == '':
            return 0.0, []

        def candidates():
            for prefix, suffix in divide(text):
                prefix_score = log10(score(prefix, prev))

                pair = (suffix, prefix)
                if pair not in memo:
                    memo[pair] = search(suffix, prefix)
                suffix_score, suffix_words = memo[pair]

                yield (prefix_score + suffix_score, [prefix] + suffix_words)

        return max(candidates())

    result_score, result_words = search(clean(text))

    return result_words

If you replaced the score function so that it returned "1" for a word in your dictionary and "0" if not then you would simply enumerate all positively scored candidates for your answer.

like image 147
GrantJ Avatar answered Sep 29 '22 09:09

GrantJ