Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to find anagrams of sentences from dictionary?

I need to make a program that takes a file with a dictionary and an arbitrary string as an input and then outputs all combinations of words from that dictionary that make up anagrams of the given string. For example, using the 100 most popular words of the English language and the string "i not work", I should get something like [' on it work', ' into work', ' not i work', ' know or it', ' work it no', ' to work in'], which I do.

The problem is that my program is far too inefficient: with 100 words in the dictionary the practical limit is 7 characters for the string length, everything after that takes far too long. I tried looking for various algorithms related to the matter to no avail.

Here's how I search for anagrams:

def sortstring(string):
    return ''.join(sorted(string))

def simplify(all_strings):
    possible_strings = defaultdict(list)
    for string in all_strings:
        possible_strings[sortstring(string).strip()].append(string)
    return possible_strings

def generate(database, length,curstring="", curdata=set()):
    if len(curstring.replace(" ", "")) > length:
        return set()
    if len((curstring).replace(" ", "")) == length:
        return curdata.union(set([curstring]))
    for i in database:
        if len((curstring+i).replace(" ", "")) <= length:
            curdata = curdata.union(generate(database.difference(set([i])),
                length, curstring+" "+i, curdata))
            database = database.difference(set([i]))
    return curdata

def analyse(database, input_string):
    cletters = countstring(input_string)
    strings = simplify(generate(database, cletters))
    data = list()
    sorted_string = sortstring(input_string).strip()
    if sorted_string in strings.keys():
        data = strings[sorted_string]
    return len(strings.values()), data

def countstring(string):
    a = countletters(string)
    return sum(a.values())

def countletters(string):
    result = {}
    for i in ascii_lowercase:
        result[i] = string.count(i)
    return result

Can anyone suggest a way to improve on this? Though I suppose that the algorithm I used should be completely ditched given that the complexity seems far too high because of how slow it is. Just in case: the program should be efficient enough to support dictionaries of tens of thousands of words and strings up to tens of characters. That's far better than what I did.

like image 971
Mashallah Avatar asked Oct 19 '22 20:10

Mashallah


1 Answers

I resolved a part of the issue myself. Resolved the for-if antipattern in the generator code:

def generate(database, length,letters,curstring="",curdata=set()):
if len(curstring.replace(" ",""))>length:
    return set()
if len((curstring).replace(" ",""))==length:
    return curdata.union(set([curstring]))
t=countletters(curstring)
for i in ascii_lowercase:
    if t[i]>letters[i]:
        return set()
for i in database:
    t=countletters(curstring+i)
    test=0
    for j in ascii_lowercase:
        if t[j]>letters[j]:
            test=1
    if test: continue
    if sum(t.values())<=length:
        curdata=curdata.union(generate(database.difference(set([i])),length,letters,curstring+" "+i,curdata))
        database=database.difference(set([i]))
return curdata

It is much, much faster now, but is still slow if the dictionary contains tens of thousands of words and/or if the input string is long.

like image 134
Mashallah Avatar answered Oct 23 '22 10:10

Mashallah