Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String parsing using Python?

Given a string such as 'helloyellowellow', parse all the valid strings from the given string. (Eg: [[hell,hello,yellow],[low, low]........]

I am looking for the most optimized way to write the code. Here is mine but I am not sure if this is the best way.

Full disclosure - This was an interview question

master = []

#   Dictionary for us to look up words   
def is_word(inputstr):
    #returns True/False


def processstring(fstr,secstr,li):
    if is_word(fstr): 
        li.append(fstr)
    if len(secstr) == 0:
        if len(li) != 0:
            master.append(li)
        return
    processstring(fstr+secstr[0], secstr[1:len(secstr)],li)



def wrapperprocess(inpstr):
    li = []
    if len(inpstr) == 0:
        return
    processstring('',inpstr,li)
    wrapperprocess(inpstr[1:len(inpstr)])


wrapperprocess('helloyellowellow')
print master
like image 710
user2917012 Avatar asked Oct 24 '13 18:10

user2917012


Video Answer


2 Answers

Since you mentioned you're looking for an efficient algorithm, and assuming you get the dictionary in advance (and not just as a callable predicate), you can use the Aho–Corasick algorithm.

Of course, if the input text is short, a more naive algorithm will be faster, for avoiding the "expensive" pre-processing of the dictionary.

Plus, an alternative python-answer: here's a simple way to simply check each substring:

def gen_words(txt):
    n = len(txt)
    for i in range(n):
        for j in range(i+1, n+1):
            subtxt = txt[i:j]
            if is_word(subtxt):
                yield subtxt

For uniqueness, do:

all_words = set(gen_words(txt))
like image 58
shx2 Avatar answered Oct 12 '22 11:10

shx2


You could do something like:

tgt='helloyellowellow'

with open('/usr/share/dict/words') as f:
    for word in f:
        word=word.strip()
        if word in tgt and len(word)>1:
            print word

Prints:

el
ell
he
hell
hello
lo
low
loy
ow
owe
we
well
ye
yell
yellow

If you are just looking for the function is_word that you have undefined, you can play with something like this:

def is_word(word, dic='/usr/share/dict/words'):
    if not hasattr(is_word, 'words'):
        with open(dic) as f:
            is_word.words={word.strip() for word in f}

    return word in is_word.words and len(word)>1

As a default data structure, Python sets have an average look-up time of O(1). You are very unlikely to write something on your own that is faster.

like image 21
dawg Avatar answered Oct 12 '22 12:10

dawg