Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithm to generate a crossword [closed]

People also ask

Can AI solve crosswords?

Now, the University of Siena in Italy and expert.ai have a launched an AI software called WebCrow 2.0 that can solve crossword puzzles not just in English, but also in Italian. WebCrow 2.0 uses natural language processing technology to understand a puzzle's clues like a human player would.


I came up with a solution which probably isn't the most efficient, but it works well enough. Basically:

  1. Sort all the words by length, descending.
  2. Take the first word and place it on the board.
  3. Take the next word.
  4. Search through all the words that are already on the board and see if there are any possible intersections (any common letters) with this word.
  5. If there is a possible location for this word, loop through all the words that are on the board and check to see if the new word interferes.
  6. If this word doesn't break the board, then place it there and go to step 3, otherwise, continue searching for a place (step 4).
  7. Continue this loop until all the words are either placed or unable to be placed.

This makes a working, yet often quite poor crossword. There were a number of alterations I made to the basic recipe above to come up with a better result.

  • At the end of generating a crossword, give it a score based on how many of the words were placed (the more the better), how large the board is (the smaller the better), and the ratio between height and width (the closer to 1 the better). Generate a number of crosswords and then compare their scores and choose the best one.
    • Instead of running an arbitrary number of iterations, I've decided to create as many crosswords as possible in an arbitrary amount of time. If you only have a small word list, then you'll get dozens of possible crosswords in 5 seconds. A larger crossword might only be chosen from 5-6 possibilities.
  • When placing a new word, instead of placing it immediately upon finding an acceptable location, give that word location a score based on how much it increases the size of the grid and how many intersections there are (ideally you'd want each word to be crossed by 2-3 other words). Keep track of all the positions and their scores and then choose the best one.

I just recently wrote my own in Python. You can find it here: http://bryanhelmig.com/python-crossword-puzzle-generator/. It doesn't create the dense NYT style crosswords, but the style of crosswords you might find in a child's puzzle book.

Unlike a few algorithms I found out there that implemented a random brute-force method of placing words like a few have suggested, I tried to implement a slightly smarter brute-force approach at word placement. Here's my process:

  1. Create a grid of whatever size and a list of words.
  2. Shuffle the word list, and then sort the words by longest to shortest.
  3. Place the first and longest word at the upper left most position, 1,1 (vertical or horizontal).
  4. Move onto next word, loop over each letter in the word and each cell in the grid looking for letter to letter matches.
  5. When a match is found, simply add that position to a suggested coordinate list for that word.
  6. Loop over the suggested coordinate list and "score" the word placement based on how many other words it crosses. Scores of 0 indicate either bad placement (adjacent to existing words) or that there were no word crosses.
  7. Back to step #4 until word list is exhausted. Optional second pass.
  8. We should now have a crossword, but the quality can be hit or miss due to some of the random placements. So, we buffer this crossword and go back to step #2. If the next crossword has more words placed on the board, it replaces the crossword in the buffer. This is time limited (find the best crossword in x seconds).

By the end, you have a decent crossword puzzle or word search puzzle, since they are about the same. It tends to run rather well, but let me know if you have any suggestions on improvement. Bigger grids run exponentially slower; bigger word lists linearly. Bigger word lists also have a much higher chance at better word placement numbers.


I actually wrote a crossword generation program about ten years ago (it was cryptic but the same rules would apply for normal crosswords).

It had a list of words (and associated clues) stored in a file sorted by descending usage to date (so that lesser-used words were at the top of the file). A template, basically a bit-mask representing the black and free squares, was chosen randomly from a pool that was provided by the client.

Then, for each non-complete word in the puzzle (basically find the first blank square and see if the one to the right (across-word) or the one underneath (down-word) is also blank), a search was done of the file looking for the first word that fitted, taking into account the letters already in that word. If there was no word that could fit, you just marked the whole word as incomplete and moved on.

At the end would be some uncompleted words which the compiler would have to fill in (and add the word and a clue to the file if desired). If they couldn't come up with any ideas, they could edit the crossword manually to change constraints or just ask for a total re-generation.

Once the word/clue file got up to a certain size (and it was adding 50-100 clues a day for this client), there was rarely a case of more than two or three manual fix ups that had to be done for each crossword.


This algorithm creates 50 dense 6x9 arrow crosswords in 60 seconds. It uses a word database (with word+tips) and a board database (with pre-configured boards).

1) Search for all starting cells (the ones with an arrow), store their size and directions
2) Loop through all starting cells
2.1) Search a word
2.1.1) Check if it was not already used
2.1.2) Check if it fits
2.2) Add the word to the board
3) Check if all cells were filled

A bigger word database decreases generation time considerably and some kind of boards are harder to fill! Bigger boards require more time to be filled correctly!


Example:

Pre-Configured 6x9 Board:

(# means one tip in one cell, % means two tips in one cell, arrows not shown)

# - # # - % # - # 
- - - - - - - - - 
# - - - - - # - - 
% - - # - # - - - 
% - - - - - % - - 
- - - - - - - - - 

Generated 6x9 Board:

# C # # P % # O # 
S A T E L L I T E 
# N I N E S # T A 
% A B # A # G A S 
% D E N S E % W E 
C A T H E D R A L 

Tips [line,column]:

[1,0] SATELLITE: Used for weather forecast
[5,0] CATHEDRAL: The principal church of a city
[0,1] CANADA: Country on USA's northern border
[0,4] PLEASE: A polite way to ask things
[0,7] OTTAWA: Canada's capital
[1,2] TIBET: Dalai Lama's region
[1,8] EASEL: A tripod used to put a painting
[2,1] NINES: Dressed up to (?)
[4,1] DENSE: Thick; impenetrable
[3,6] GAS: Type of fuel
[1,5] LS: Lori Singer, american actress
[2,7] TA: Teaching assistant (abbr.)
[3,1] AB: A blood type
[4,3] NH: New Hampshire (abbr.)
[4,5] ED: (?) Harris, american actor
[4,7] WE: The first person of plural (Grammar)

Although this is an older question, will attempt an answer based on similar work i have done.

There are many approaches to solving constraint problems (which generallay are in NPC complexity class).

This is related to combinatorial optimization and constraint programming. In this case the constraints are the geometry of the grid and the requirement that words are unique etc..

Randomization/Annealing approaches can also work (although within the proper setting).

Efficient simplicity might just be the ultimate wisdom !

The requirements were for a more or less complete crossword compiler and (visual WYSIWYG) builder.

Leaving aside the WYSIWYG builder part, the compiler outline was this:

  1. Load the available wordlists (sorted by word length, ie 2,3,..,20)

  2. Find the wordslots (ie grid words) on the user-constructed grid (eg word at x,y with length L, horizontal or vertical) ( complexity O(N) )

  3. Compute the intersecting points of the grid words (that need to be filled) ( complexity O(N^2) )

  4. Compute the intersections of the words in the wordlists with the various letters of the alphabet used (this allows to search for matching words by using a template eg. Sik Cambon thesis as used by cwc ) ( complexity O(WL*AL) )

Steps .3 and .4 allow to do this task:

a. The intersections of the grid words with themselves enable to create a "template" for trying to find matches in the associated wordlist of available words for this grid word (by using the letters of other intersecting words with this word which are already filled at a certain step of the algorithm)

b. The intersections of the words in a wordlist with the alphabet enable to find matching (candidate) words that match a given "template" (eg 'A' in 1st place and 'B' in 3rd place etc..)

So with these data structures implemented the algorithm used was sth like this:

NOTE: if the grid and the database of words are constant the previous steps can just be done once.

  1. First step of the algorithm is select an empty wordslot (grid word) at random and fill it with a candidate word from its associated wordlist (randomization enables to produce different solutons in consecutive executions of the algorithm) ( complexity O(1) or O(N) )

  2. For each still empty word slots (that have intersections with already filled wordslots), compute a constraint ratio (this can vary, sth simple is the number of available solutions at that step) and sort the empty wordslots by this ratio ( complexity O(NlogN) or O(N) )

  3. Loop through the empty wordslots computed at previous step and for each one try a number of cancdidate solutions (making sure that "arc-consistency is retained", ie grid has a solution after this step if this word is used) and sort them according to maximum availability for next step (ie next step has a maximum possible solutions if this word is used at that time in that place, etc..) ( complexity O(N*MaxCandidatesUsed) )

  4. Fill that word (mark it as filled and go to step 2)

  5. If no word found that satisfies the criteria of step .3 try to backtrack to another candidate solution of some previous step (criteria can vary here) ( complexity O(N) )

  6. If backtrack found, use the alternative and optionally reset any already filled words that might need reset (mark them as unfilled again) ( complexity O(N) )

  7. If no backtrack found, the no solution can be found (at least with this configuration, initial seed etc..)

  8. Else when all wordlots are filled you have one solution

This algorithm does a random consistent walk of the solution tree of the problem. If at some point there is a dead end, it does a backtrack to a previous node and follow another route. Untill either a solution found or number of candidates for the various nodes are exhausted.

The consistency part makes sure that a solution found is indeed a solution and the random part enables to produce different solutions in different executions and also on the average have better performance.

PS. all this (and others) were implemented in pure JavaScript (with parallel processing and WYSIWYG) capability

PS2. The algorithm can be easily parallelized in order to produce more than one (different) solution at the same time

Hope this helps