Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solve Hangman in AI way

Tags:

algorithm

I named it as "AI way" because I'm thinking make Application to play the hangman game without human being interactive.

The scenario is like this:

  1. a available word list which would contains hundreds of thousands English word.
  2. The Application will pick certain amount of words, e.g 20 from the list.
  3. The Application play Hangman against each word until either WON or FAILURE. The restriction here is max wrong bad guess. 26 does not make sense obviously and let's say 6 for the max wrong guess.

I tried the strategy mentioned at wiki page but it does not work well. Basically successful rate is about 30%.

Any suggestions / comments regarding strategy as well as which field I should dig in order to find a fair good strategy?

Thanks a lot.

-Simon

PS: A JavaScript implementation which looks fairly well. (https://github.com/freizl/play-hangman-game)

like image 434
Simon Avatar asked Feb 09 '12 05:02

Simon


Video Answer


2 Answers

Updated Idea

  1. Download a dictionary of words and put it into some database or structure of your choice
  2. When presented with a word, narrow your guesses to words of the same length and perform a letter frequency distribution (you can use a dictionary and/or list collection for fast distribution analysis and sorting)
  3. Pick the most common letter from this list
  4. If the letter is found, create a regex pattern based on the known letter(s) and the word length and repeat from step 2
  5. You should be able to quickly narrow down a single word resulting from your pattern search

For posterity:

Take a look at this wiki page. It includes a table of frequencies of the first letters of words which may help you tune your algorithm.

You could also take into account the fact that if you find a vowel or two in a word the likelihood of finding other vowels will decrease significantly and you should then try more common consonants instead. The example from the wiki page you listed start with E then T and then tries three vowels in a row: A, O and I. The first two letters are missed but once the third letter is found, twice then the process should switch to common consonants and skip trying for more vowels since there will likely be fewer.

Any useful strategies will certainly employ frequency distribution charts on letters and possibly words e.g. some words are very common while others are rarely used so performing a letter frequency distribution on a set of more common words might help... guessing that some words may appear more frequently than other but that depends on your word selection algorithm which might not take into account "common" usage.

You could also build specialized letter frequency tables and possibly even on-the-fly. For example, given the wikipedia h a ngm a n example: You find the letter A twice in a word in two locations 2nd and 6th. You know that the word has seven letters and with a fairly simple reg ex you could isolate the words from a dictionary that match this pattern:

_ a _ _ _ a _

Then perform a letter frequency on that set of words that matches this pattern and use that set for your next guess. Rinse and repeat. I think doing some of those things I mentioned but especially the last will really increase your odds of success.

like image 150
Paul Sasik Avatar answered Nov 08 '22 10:11

Paul Sasik


The strategies in the linked page seem to be "order guesses by letter frequency" and "guess the vowels, then order guesses by letter frequency"

A couple of observations about hangman:

1) Since guessing a letter that isn't in the word hurts us, we should guess letters by word frequency (percentage of words that contain letter X), not letter frequency (number of times that X appears in all words). This should maximise our chances of guessing a bad letter.

2) Once we've guessed some letters correctly, we know more about the word we're trying to guess.

Here are two strategies that should beat the letter frequency strategy. I'm going to assume we have a dictionary of words that might come up.

If we expect the word to be in our dictionary:

1) We know the length of the target word, n. Remove all words in the dictionary that aren't of length n

2) Calculate the word frequency of all letters in the dictionary

3) Guess the most frequent letter that we haven't already guessed.

4) If we guessed correctly, remove all words from the dictionary that don't match the revealed letters.

5) If we guessed incorrectly, remove all words that contain the incorrectly guessed letter

6) Go to step 2

For maximum effect, instead of calculating word frequencies of all letters in step 2, calculate the word frequencies of all letters in positions that are still blank in the target word.

If we don't expect the word to be in our dictionary:

1) From the dictionary, build up a table of n-grams for some value of n (say 2). If you haven't come across n-grams before, they are groups of consecutive letters inside the word. For example, if the word is "word", the 2-grams are {^w,wo,or,rd,d$}, where ^ and $ mark the start and the end of the word. Count the word frequency of these 2-grams.

2) Start by guessing single letters by word frequency as above

3) Once we've had some hits, we can use the table of word frequency of n-grams to determine either letters to eliminate from our guesses, or letters that we're likely to be able to guess. There are a lot of ways you could achieve this:

For example, you could use 2-grams to determine that the blank in w_rd is probably not z. Or, you could determine that the character at the end of the word ___e_ might (say) be d or s.

Alternatively you could use the n-grams to generate the list of possible characters (though this might be expensive for long words). Remember that you can always cross off all n-grams that contain letters you've guessed that aren't in the target word.

Remember that at each step you're trying not to make a wrong guess, since that keeps us alive. If the n-grams tell you that one position is likely to only be (say) a,b or c, and your word frequency table tells you that a appears in 30% of words, but b and c only appear in 10%, then guess a.

For maximum benefit, you could combine the two strategies.

like image 35
Timothy Jones Avatar answered Nov 08 '22 11:11

Timothy Jones