What is a good algorithm to determine the "difficulty" of a word for a hangman game, so that the game can select words to match a specified difficulty level?
Difficulty would seem related to the number of guesses required, the relative frequency of usage of letters (e.g. words with many uncommon letters may be harder to guess), and potentially the length of the word.
There are also some subjective factors to (attempt to) compensate for, such as the likelihood a word is in the vocabulary of the player, and can be recognised, allowing moving from a guessing strategy based on letter frequencies alone to guessing based on list of known matching words.
My attempt for now is below in ruby. Any suggestions on how to improve the categorisation?
def classify_word(w) n = w.chars.to_a.uniq.length # Num. unique chars in w if n < 5 and w.length > 4 return WordDifficulty::Easy end if n > w.length / 2 return WordDifficulty::Hard else return WordDifficulty::Medium end end
I am writing a hangman game I would like my children to play; I am rather too old to be attempting "homework", which may be why the question is receiving so many down votes... Words are drawn randomly from large word databases, which include many obscure words, and are being filtered by the difficulty level determined for the word.
There are several factors that relate to word difficulty, including age at acquisition, imageability, concreteness, abstractness, syllables, frequency (spoken and written). There are also psycholinguistic databases that will search for word by at least some of these factors.
Here's a way to approach this problem systematically: if you have an algorithm that plays hangman well, then you can take the difficulty of each word to be the number of wrong guesses that your program would take if guessing that word.
There's an idea that's implicit in some the other answers and comments, that the optimal strategy for the solver would be to base their decisions on the frequency of letters in English, or on the frequency of words in some corpus. This is a seductive idea, but it's not quite right. The solver does best if it accurately models the distribution of words chosen by the setter, and a human setter may well be choosing words based on their rarity or avoidance of frequently used letters. For example, although E
is the most frequently used letter in English, if the setter always chooses from the words JUGFUL
, RHYTHM
, SYZYGY
, and ZYTHUM
, then a perfect solver does not start by guessing E
!
The best approach to modelling the setter depends on the context, but I guess that some kind of Bayesian inductive inference would work well in a context where the solver plays many games against the same setter, or against a group of similar setters.
Here I'll outline a solver that is pretty good (but far from perfect). It models the setter as choosing words uniformly from a fixed dictionary. It's a greedy algorithm: at each stage it guesses the letter that minimizes the number of misses, that is, words that do not contain the guess. For example, if no guesses have been made so far, and the possible words are DEED
, DEAD
and DARE
, then:
D
or E
, there are no misses;A
, there's one miss (DEED
);R
, there are two misses (DEED
and DEAD
);So either D
or E
is a good guess in this situation.
(Thanks to Colonel Panic in comments for pointing out that correct guesses are free in hangman—I totally forgot this in my first attempt!)
Here's an implementation of this algorithm in Python:
from collections import defaultdict from string import ascii_lowercase def partition(guess, words): """Apply the single letter 'guess' to the sequence 'words' and return a dictionary mapping the pattern of occurrences of 'guess' in a word to the list of words with that pattern. >>> words = 'deed even eyes mews peep star'.split() >>> sorted(list(partition('e', words).items())) [(0, ['star']), (2, ['mews']), (5, ['even', 'eyes']), (6, ['deed', 'peep'])] """ result = defaultdict(list) for word in words: key = sum(1 << i for i, letter in enumerate(word) if letter == guess) result[key].append(word) return result def guess_cost(guess, words): """Return the cost of a guess, namely the number of words that don't contain the guess. >>> words = 'deed even eyes mews peep star'.split() >>> guess_cost('e', words) 1 >>> guess_cost('s', words) 3 """ return sum(guess not in word for word in words) def word_guesses(words, wrong = 0, letters = ''): """Given the collection 'words' that match all letters guessed so far, generate tuples (wrong, nguesses, word, guesses) where 'word' is the word that was guessed; 'guesses' is the sequence of letters guessed; 'wrong' is the number of these guesses that were wrong; 'nguesses' is len(guesses). >>> words = 'deed even eyes heel mere peep star'.split() >>> from pprint import pprint >>> pprint(sorted(word_guesses(words))) [(0, 1, 'mere', 'e'), (0, 2, 'deed', 'ed'), (0, 2, 'even', 'en'), (1, 1, 'star', 'e'), (1, 2, 'eyes', 'en'), (1, 3, 'heel', 'edh'), (2, 3, 'peep', 'edh')] """ if len(words) == 1: yield wrong, len(letters), words[0], letters return best_guess = min((g for g in ascii_lowercase if g not in letters), key = lambda g:guess_cost(g, words)) best_partition = partition(best_guess, words) letters += best_guess for pattern, words in best_partition.items(): for guess in word_guesses(words, wrong + (pattern == 0), letters): yield guess
Using this strategy it's possible to evaluate the difficulty of guessing each word in a collection. Here I consider the six-letter words in my system dictionary:
>>> words = [w.strip() for w in open('/usr/share/dict/words') if w.lower() == w] >>> six_letter_words = set(w for w in words if len(w) == 6) >>> len(six_letter_words) 15066 >>> results = sorted(word_guesses(six_letter_words))
The easiest words to guess in this dictionary (together with the sequence of guesses needed for the solver to guess them) are as follows:
>>> from pprint import pprint >>> pprint(results[:10]) [(0, 1, 'eelery', 'e'), (0, 2, 'coneen', 'en'), (0, 2, 'earlet', 'er'), (0, 2, 'earner', 'er'), (0, 2, 'edgrew', 'er'), (0, 2, 'eerily', 'el'), (0, 2, 'egence', 'eg'), (0, 2, 'eleven', 'el'), (0, 2, 'enaena', 'en'), (0, 2, 'ennead', 'en')]
and the hardest words are these:
>>> pprint(results[-10:]) [(12, 16, 'buzzer', 'eraoiutlnsmdbcfg'), (12, 16, 'cuffer', 'eraoiutlnsmdbpgc'), (12, 16, 'jugger', 'eraoiutlnsmdbpgh'), (12, 16, 'pugger', 'eraoiutlnsmdbpcf'), (12, 16, 'suddle', 'eaioulbrdcfghmnp'), (12, 16, 'yucker', 'eraoiutlnsmdbpgc'), (12, 16, 'zipper', 'eraoinltsdgcbpjk'), (12, 17, 'tuzzle', 'eaioulbrdcgszmnpt'), (13, 16, 'wuzzer', 'eraoiutlnsmdbpgc'), (13, 17, 'wuzzle', 'eaioulbrdcgszmnpt')]
The reason that these are hard is because after you've guessed -UZZLE
, you still have seven possibilities left:
>>> ' '.join(sorted(w for w in six_letter_words if w.endswith('uzzle'))) 'buzzle guzzle muzzle nuzzle puzzle tuzzle wuzzle'
Of course when preparing wordlists for your children you wouldn't start with your computer's system dictionary, you'd start with a list of words that you think they are likely to know. For example, you might have a look at Wiktionary's lists of the most frequently used words in various English corpora.
For example, among the 1,700 six-letter words in the 10,000 most common words in Project Gutenberg as of 2006, the most difficult ten are these:
[(6, 10, 'losing', 'eaoignvwch'), (6, 10, 'monkey', 'erdstaoync'), (6, 10, 'pulled', 'erdaioupfh'), (6, 10, 'slaves', 'erdsacthkl'), (6, 10, 'supper', 'eriaoubsfm'), (6, 11, 'hunter', 'eriaoubshng'), (6, 11, 'nought', 'eaoiustghbf'), (6, 11, 'wounds', 'eaoiusdnhpr'), (6, 11, 'wright', 'eaoithglrbf'), (7, 10, 'soames', 'erdsacthkl')]
(Soames Forsyte is a character in the Forsyte Saga by John Galsworthy; the wordlist has been converted to lower-case so it wasn't possible for me to quickly remove proper names.)
A really simple way would be to compute a score based on the lack of vowels in the word, the number of unique letters, and the commonness of each letter:
letters = 'etaoinshrdlcumwfgypbvkjxqz' vowels = set('aeiou') def difficulty(word): unique = set(word) positions = sum(letters.index(c) for c in word) return len(word) * len(unique) * (7 - len(unique & vowels)) * positions words = ['the', 'potato', 'school', 'egypt', 'floccinaucinihilipilification'] for word in words: print difficulty(word), word
And the output:
432 the 3360 potato 7200 school 7800 egypt 194271 floccinaucinihilipilification
You could then score the words with:
score < 2000 # Easy 2000 < score < 10000 # Medium 10000 < score # Hard
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With