This is a twofold question, because I'm out of ideas on how to implement this most efficiently.
I have a dictionary of 150,000 words, stored into a Trie implementation, here's what my particular implementation looks like:
A user is given a provided with two words. With the goal being to find the shortest path of other english words (changed by one character apiece) from the start word to the end word.
For example:
Start: Dog
End: Cat
Path: Dog, Dot, Cot, Cat
Path: Dog, Cog, Log, Bog, Bot, Cot, Cat
Path: Dog, Doe, Joe, Joy, Jot, Cot, Cat
My current implementation has gone through several iterations, but the simplest I can provide pseudocode for (as the actual code is several files):
var start = "dog";
var end = "cat";
var alphabet = [a, b, c, d, e .... y, z];
var possible_words = [];
for (var letter_of_word = 0; letter_of_word < start.length; letter_of_word++) {
for (var letter_of_alphabet = 0; letter_of_alphabet < alphabet.length; letter_of_alphabet++) {
var new_word = start;
new_word.characterAt(letter_of_word) = alphabet[letter_of_alphabet];
if (in_dictionary(new_word)) {
add_to.possible_words;
}
}
}
function bfs() {
var q = [];
... usual bfs implementation here ..
}
Knowns:
My issue is I do not have an efficient way of determining a potential word to try without brute-forcing the alphabet and checking each new word against the dictionary. I know there is a possibility of a more efficient way using prefixes, but I can't figure out a proper implementation, or one that doesn't just double the processing.
Secondly, should I be using a different search algorithm, I've looked at A* and Best First Search as possibilities, but those require weights, which I don't have.
Thoughts?
As requested in comments, illustrating what I mean by encoding linked words in the bits of integers.
In C++, it might look something like...
// populate a list of known words (or read from file etc)...
std::vector<std::string> words = {
"dog", "dot", "cot", "cat", "log", "bog"
};
// create sets of one-letter-apart words...
std::unordered_map<std::string, int32_t> links;
for (auto& word : words)
for (int i = 0; i < word.size(); ++i)
{
char save = word[i];
word[i] = '_';
links[word] |= 1 << (save - 'a');
word[i] = save;
}
After the above code runs, links[x]
- where x
is a word with one letter replaced with an underscore a la d_g
- retrieves an integer indicating the letters that can replace the underscore to form known words. If the least significant bit is on, then 'dag' is a known word, if the next-from-least-significant bit is on, then 'dbg' is known word etc..
Intuitively I'd expect using integers to reduce the overall memory used for linkage data, but if the majority of words only have a couple linked words each, storing some index or pointer to those words may actually use less memory - and be easier if you're unused to bitwise manipulations, i.e.:
std::unordered_map<std::string, std::vector<const char*>> links;
for (auto& word : words)
for (int i = 0; i < word.size(); ++i)
{
char save = word[i];
word[i] = '_';
links[word].push_back(word.c_str());
word[i] = save;
}
Either way, you then have a graph linking each word to those it can transform into with single-character changes. You can then apply the logic of Dijkstra's algorithm to find a shortest path between any two words.
Just to add an update for those that starred this question, I've added a Github repository for an implementation in Javascript for this particular Data-structure.
https://github.com/acupajoe/Lexibit.js
Thank you all for the help and ideas!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With