Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word list generation (sorting, optimization)

A little background:

I'm constructing lists of words for a psychology experiment. We're trying to create a chain of words such that words adjacent in the list are related, but all other words in the list are not related. So for example:

SCHOOL, CAFETERIA, PIZZA, CRUST, EARTH, OCEAN, WHALE, ...

So here we see the first word is related to the second, and the second is related to the third, but the third isn't related to the first. (And the first isn't related to the fourth, fifth, sixth, ... either)

What I have so far...

I have a list of 1600 words such that each number from 0 to 1600 corresponds to a word. I also have a very large matrix (1600 x 1600) that tells me (on a scale of 0 to 1) how related each word is to every other word. (These are from a latent semantic analysis; http://lsa.colorado.edu/)

I can make the lists, but it's not very efficient at all, and my adjacent words aren't super strongly related to each other.

Here's my basic algorithm:

  • Set thresholds for minimum value for how related the adjacent words must be and for how unrelated the non-adjacent words must be.
  • Create a list from 0 to 1600. Shuffle that list. The first item of the list will be our first word.
  • Loop through our words, checking one by one if the word meets our thresholds (i.e., check that this new word is high enough related to the last added word in the list, loop through our list and check that it's unrelated to all other words and that it isn't already in our list). If it meets the criteria, add it to the list. If we loop through all words with no success, dump the list and start all over.
  • Continue this until the list has as many words as I want (ideally, 16).

Does anyone have a better approach? The problem with my approach is that I'll sometimes settle for an okay match that meets my criteria when a better match is potentially still out there. Also, it would be nice if I didn't have to dump the whole list but could backtrack a few steps to where the list potentially went wrong.

like image 843
user2073725 Avatar asked Oct 30 '17 20:10

user2073725


1 Answers

This might be a good candidate for a genetic algorithm. You can create a large number of completely random possibilities, score each one with an objective function, and then iterate the population by crossing over mates based on fitness (possibly throwing some mutations in as well).

If done properly, this should give you a large-ish population of good solutions. If the population is large enough, the fitness function defined well enough and mutation is sufficient to get you out of any valleys you might otherwise get stuck in, you might even converge overwhelmingly on the optimal answer.

like image 165
Patrick87 Avatar answered Sep 18 '22 12:09

Patrick87