So first off I'm very new to Python so if I'm doing something awful I'm prefacing this post with a sorry. I've been assigned this problem:
We want to devise a dynamic programming solution to the following problem: there is a string of characters which might have been a sequence of words with all the spaces removed, and we want to find a way, if any, in which to insert spaces that separate valid English words. For example, theyouthevent could be from “the you the vent”, “the youth event” or “they out he vent”. If the input is theeaglehaslande, then there’s no such way. Your task is to implement a dynamic programming solution in two separate ways:
Assume that the original sequence of words had no other punctuation (such as periods), no capital letters, and no proper names - all the words will be available in a dictionary file that will be provided to you.
So I'm having two main issues:
What I'd like:
As always thanks for any time and or effort anyone gives this, it is always appreciated.
Here's my attempt:
#dictionary function returns True if word is found in dictionary false otherwise
def dictW(s):
diction = open("diction10k.txt",'r')
for x in diction:
x = x.strip("\n \r")
if s == x:
return True
return False
def iterativeSplit(s):
n = len(s)
i = j = k = 0
A = [-1] * n
word = [""] * n
booly = False
for i in range(0, n):
for j in range(0, i+1):
prefix = s[j:i+1]
for k in range(0, n):
if word[k] == prefix:
#booly = True
A[k] = 1
#print "Array below at index k %d and word = %s"%(k,word[k])
#print A
# print prefix, A[i]
if(((A[i] == -1) or (A[i] == 0))):
if (dictW(prefix)):
A[i] = 1
word[i] = prefix
#print word[i], i
else:
A[i] = 0
for i in range(0, n):
print A[i]
For another real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the memoization approach.
In particular, segment
illustrates the memoization approach:
def segment(text):
"Return a list of words that is the best segmenation of `text`."
memo = dict()
def search(text, prev='<s>'):
if text == '':
return 0.0, []
def candidates():
for prefix, suffix in divide(text):
prefix_score = log10(score(prefix, prev))
pair = (suffix, prefix)
if pair not in memo:
memo[pair] = search(suffix, prefix)
suffix_score, suffix_words = memo[pair]
yield (prefix_score + suffix_score, [prefix] + suffix_words)
return max(candidates())
result_score, result_words = search(clean(text))
return result_words
If you replaced the score
function so that it returned "1" for a word in your dictionary and "0" if not then you would simply enumerate all positively scored candidates for your answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With