On the back of a block calendar I found the following riddle:
How many common English words of 4 letters or more can you make from the letters of the word 'textbook' (each letter can only be used once).
My first solution that I came up with was:
from itertools import permutations
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
found_words = []
ps = (permutations(given_word, i) for i in range(4, len(given_word)+1))
for p in ps:
for word in map(''.join, p):
if word in words and word != given_word:
found_words.append(word)
print set(found_words)
This gives the result set(['tote', 'oboe', 'text', 'boot', 'took', 'toot', 'book', 'toke', 'betook'])
but took more than 7 minutes on my machine.
My next iteration was:
with open('/usr/share/dict/words') as f:
words = f.readlines()
words = map(lambda x: x.strip(), words)
given_word = 'textbook'
print [word for word in words if len(word) >= 4 and sorted(filter(lambda letter: letter in word, given_word)) == sorted(word) and word != given_word]
Which return an answer almost immediately but gave as answer: ['book', 'oboe', 'text', 'toot']
What is the fastest, correct and most pythonic solution to this problem?
(edit: added my earlier permutations solution and its different output).
I thought I'd share this slightly interesting trick although it takes a good bit more code than the rest and isn't really "pythonic". This will take a good bit more code than the other solutions but should be rather quick if I look at the timing the others need.
We're doing a bit preprocessing to speed up the computations. The basic approach is the following: We assign every letter in the alphabet a prime number. E.g. A = 2, B = 3, and so on. We then compute a hash for every word in the alphabet which is simply the product of the prime representations of every character in the word. We then store every word in a dictionary indexed by the hash.
Now if we want to find out which words are equivalent to say textbook
we only have to compute the same hash for the word and look it up in our dictionary. Usually (say in C++) we'd have to worry about overflows, but in python it's even simpler than that: Every word in the list with the same index will contain exactly the same characters.
Here's the code with the slightly optimization that in our case we only have to worry about characters also appearing in the given word, which means we can get by with a much smaller prime table than otherwise (the obvious optimization would be to only assign characters that appear in the word a value at all - it was fast enough anyhow so I didn't bother and this way we could pre process only once and do it for several words). The prime algorithm is useful often enough so you should have one yourself anyhow ;)
from collections import defaultdict
from itertools import permutations
PRIMES = list(gen_primes(256)) # some arbitrary prime generator
def get_dict(path):
res = defaultdict(list)
with open(path, "r") as file:
for line in file.readlines():
word = line.strip().upper()
hash = compute_hash(word)
res[hash].append(word)
return res
def compute_hash(word):
hash = 1
for char in word:
try:
hash *= PRIMES[ord(char) - ord(' ')]
except IndexError:
# contains some character out of range - always 0 for our purposes
return 0
return hash
def get_result(path, given_word):
words = get_dict(path)
given_word = given_word.upper()
result = set()
powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x]
for word in (word for word in powerset(given_word) if len(word) >= 4):
hash = compute_hash(word)
for equiv in words[hash]:
result.add(equiv)
return result
if __name__ == '__main__':
path = "dict.txt"
given_word = "textbook"
result = get_result(path, given_word)
print(result)
Runs on my ubuntu word list (98k words) rather quickly, but not what I'd call pythonic since it's basically a port of a c++ algorithm. Useful if you want to compare more than one word that way..
How about this?
from itertools import permutations, chain
with open('/usr/share/dict/words') as fp:
words = set(fp.read().split())
given_word = 'textbook'
perms = (permutations(given_word, i) for i in range(4, len(given_word)+1))
pwords = (''.join(p) for p in chain(*perms))
matches = words.intersection(pwords)
print matches
which gives
>>> print matches
set(['textbook', 'keto', 'obex', 'tote', 'oboe', 'text', 'boot', 'toto', 'took', 'koto', 'bott', 'tobe', 'boke', 'toot', 'book', 'bote', 'otto', 'toke', 'toko', 'oket'])
There is a generator itertools.permutations
with which you can gather all permutations of a sequence with a specified length. That makes it easier:
from itertools import permutations
GIVEN_WORD = 'textbook'
with open('/usr/share/dict/words', 'r') as f:
words = [s.strip() for s in f.readlines()]
print len(filter(lambda x: ''.join(x) in words, permutations(GIVEN_WORD, 4)))
Edit #1: Oh! It says "4 or more" ;) Forget what I said!
Edit #2: This is the second version I came up with:
LETTERS = set('textbook')
with open('/usr/share/dict/words') as f:
WORDS = filter(lambda x: len(x) >= 4, [l.strip() for l in f])
matching = filter(lambda x: set(x).issubset(LETTERS) and all([x.count(c) == 1 for c in x]), WORDS)
print len(matching)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With