I need to make a program that takes a file with a dictionary and an arbitrary string as an input and then outputs all combinations of words from that dictionary that make up anagrams of the given string.
For example, using the 100 most popular words of the English language and the string "i not work"
, I should get something like [' on it work', ' into work', ' not i work', ' know or it', ' work it no', ' to work in']
, which I do.
The problem is that my program is far too inefficient: with 100 words in the dictionary the practical limit is 7 characters for the string length, everything after that takes far too long. I tried looking for various algorithms related to the matter to no avail.
Here's how I search for anagrams:
def sortstring(string):
return ''.join(sorted(string))
def simplify(all_strings):
possible_strings = defaultdict(list)
for string in all_strings:
possible_strings[sortstring(string).strip()].append(string)
return possible_strings
def generate(database, length,curstring="", curdata=set()):
if len(curstring.replace(" ", "")) > length:
return set()
if len((curstring).replace(" ", "")) == length:
return curdata.union(set([curstring]))
for i in database:
if len((curstring+i).replace(" ", "")) <= length:
curdata = curdata.union(generate(database.difference(set([i])),
length, curstring+" "+i, curdata))
database = database.difference(set([i]))
return curdata
def analyse(database, input_string):
cletters = countstring(input_string)
strings = simplify(generate(database, cletters))
data = list()
sorted_string = sortstring(input_string).strip()
if sorted_string in strings.keys():
data = strings[sorted_string]
return len(strings.values()), data
def countstring(string):
a = countletters(string)
return sum(a.values())
def countletters(string):
result = {}
for i in ascii_lowercase:
result[i] = string.count(i)
return result
Can anyone suggest a way to improve on this? Though I suppose that the algorithm I used should be completely ditched given that the complexity seems far too high because of how slow it is. Just in case: the program should be efficient enough to support dictionaries of tens of thousands of words and strings up to tens of characters. That's far better than what I did.
I resolved a part of the issue myself. Resolved the for-if antipattern in the generator code:
def generate(database, length,letters,curstring="",curdata=set()):
if len(curstring.replace(" ",""))>length:
return set()
if len((curstring).replace(" ",""))==length:
return curdata.union(set([curstring]))
t=countletters(curstring)
for i in ascii_lowercase:
if t[i]>letters[i]:
return set()
for i in database:
t=countletters(curstring+i)
test=0
for j in ascii_lowercase:
if t[j]>letters[j]:
test=1
if test: continue
if sum(t.values())<=length:
curdata=curdata.union(generate(database.difference(set([i])),length,letters,curstring+" "+i,curdata))
database=database.difference(set([i]))
return curdata
It is much, much faster now, but is still slow if the dictionary contains tens of thousands of words and/or if the input string is long.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With