I am building a word unscrambler (php/mysql) that takes user input of between 2 and 8 letters and returns words of between 2 and 8 letters that can be made from those letters, not necessarily using all of the letters, but definitely not including more letters than supplied.
The user will enter something like MSIKE or MSIKEI (two i's), or any combination of letters or multiple occurrences of a letter.
The query below will find all occurrences of words that contain M, S, I, K, or E.
However, the query below also returns words that have multiple occurrences of letters not requested. For example, the word meek would be returned, even though it has two e's and the user didn't enter two e's, or the word kiss, even though the user didn't enter s twice.
SELECT word
FROM words
WHERE word REGEXP '[msike]'
AND has_a=0
AND has_b=0
AND has_c=0
AND has_d=0
(we skip e) or we could add has_e=1
AND has_f=0
...and so on...skipping letters m, s, i, k, and e
AND has_w=0
AND has_x=0
AND has_y=0
AND has_z=0
Note the columns has_a, has_b, etc are either 1 if the letter occurs in the word or 0 if not.
I am open to any changes to the table schema.
This site: http://grecni.com/texttwist.php is a good example of what I am trying to emulate.
Question is how to modify the query to not return words with multiple occurrences of a letter, unless the user specifically entered a letter multiple times. Grouping by word length would be an added bonus.
Thanks so much.
EDIT: I altered the db per the suggestion of @awei, The has_{letter} is now count_{letter} and stores the total number of occurrences of the respective letter in the respective word. This could be useful when a user enters a letter multiple times. example: user enters MSIKES (two s).
Additionally, I have abandoned the REGEXP approach as shown in the original SQL statement. Working on doing most of the work on the PHP side, but many hurdles still in the way.
EDIT: Included first 10 rows from table
id word alpha otcwl ospd csw sowpods dictionary enable vowels consonants start_with end_with end_with_ing end_with_ly end_with_xy count_a count_b count_c count_d count_e count_f count_g count_h count_i count_j count_k count_l count_m count_n count_o count_p count_q count_r count_s count_t count_u count_v count_w count_x count_y count_z q_no_u letter_count scrabble_points wwf_points status date_added
1 aa aa 1 0 0 1 1 1 aa a a 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 1 2015-11-12 05:39:45
2 aah aah 1 0 0 1 0 1 aa h a h 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 6 5 1 2015-11-12 05:39:45
3 aahed aadeh 1 0 0 1 0 1 aae hd a d 0 0 0 2 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 9 8 1 2015-11-12 05:39:45
4 aahing aaghin 1 0 0 1 0 1 aai hng a g 1 0 0 2 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 6 10 11 1 2015-11-12 05:39:45
5 aahs aahs 1 0 0 1 0 1 aa hs a s 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 4 7 6 1 2015-11-12 05:39:45
6 aal aal 1 0 0 1 0 1 aa l a l 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 3 4 1 2015-11-12 05:39:45
7 aalii aaiil 1 0 0 1 1 1 aaii l a i 0 0 0 2 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 5 6 1 2015-11-12 05:39:45
8 aaliis aaiils 1 0 0 1 0 1 aaii ls a s 0 0 0 2 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 6 6 7 1 2015-11-12 05:39:45
9 aals aals 1 0 0 1 0 1 aa ls a s 0 0 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 4 4 5 1 2015-11-12 05:39:45
10 aardvark aaadkrrv 1 0 0 1 1 1 aaa rdvrk a k 0 0 0 3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 8 16 17 1 2015-11-12 05:39:45
Think you've already done the hard work with your revised schema. All you need to do now is modify the query to look for <=
the number of counts of each letter as specified by the user.
E.g. if the user entered "ALIAS":
SELECT word
FROM words
WHERE count_a <= 2
AND count_b <= 0
AND count_c <= 0
AND count_d <= 0
AND count_e <= 0
AND count_f <= 0
AND count_g <= 0
AND count_h <= 0
AND count_i <= 1
AND count_j <= 0
AND count_k <= 0
AND count_l <= 1
AND count_m <= 0
AND count_n <= 0
AND count_o <= 0
AND count_p <= 0
AND count_q <= 0
AND count_r <= 0
AND count_s <= 1
AND count_t <= 0
AND count_u <= 0
AND count_v <= 0
AND count_w <= 0
AND count_x <= 0
AND count_y <= 0
AND count_z <= 0
ORDER BY CHAR_LENGTH(word), word;
Note: As requested, this is ordering by word length, then alphabetically. Have used <=
even for <= 0
just to make it easier to modify by hand for other letters.
This returns "aa", "aal" and "aals" (but not "aalii" or "aaliis" since they both have two "i"s).
See SQL Fiddle Demo.
Since you have two different requirements, I suggest implementing both two different solutions.
Where you don't care about dup letters, build a SET
datatype with the 26 letters. Populate the bits according what the word has. This ignores duplicate letters. This also facilitates looking for words with a subset of the letters: (the_set & ~the_letters) = 0
.
Where you do care about dups, sort the letters in the word and store that as the key. "msike" becomes "eikms".
Build a table that contains 3 columns:
eikms -- non unique index on this
msike -- the real word - probably good to have this as the PRIMARY KEY
SET('m','s','i',','k','e') -- for the other situation.
msikei and meek would be entered as
eikms
msikei
SET('m','s','i',','k','e') -- (or, if more convenient: SET('m','i','s','i',','k','e')
ekm
meek
SET('e','k','m')
REGEXP
is not practical for your task.
Edit 1
I think you also need a column that indicates whether there are any doubled letters in the word. That way, you can distinguish that kiss
is allowed for msikes
but for for msike
.
Edit 2
A SET
or an INT UNSIGNED
can hold 1 bit for each of the 26 letters -- 0 for not present, 1 for present.
msikes
and msike
would both go into the set with exactly 5 bits turned on. The value to INSERT
would be 'm,s,i,k,e,s'
for msikes
. Since the rest needs to involve Boolean arithmetic, maybe it would be better to use INT UNSIGNED
. So...
a is 1 (1 << 0)
b is 2 (1 << 1)
c is 4 (1 << 2)
d is 8 (1 << 3)
...
z is (1 << 25)
To INSERT
you use the |
operator. bad
becomes
(1 << 1) | (1 << 0) | (1 << 3)
Note how the bits are laid out, with 'a' at the bottom:
SELECT BIN((1 << 1) | (1 << 0) | (1 << 3)); ==> 1011
Similarly 'ad' is 1001. So, does 'ad' match 'bad'? The answer comes from
SELECT b'1001' & ~b'1011' = 0; ==> 1 (meaning 'true')
That means that all the letters in 'ad' (1001) are found in 'bad' (1011). Let's introduce "bed", which is 11010.
SELECT b'11010' & ~b'1011' = 0; ==> FALSE because of 'e' (10000)
But 'dad' (1001) will work fine:
SELECT b'1001' & ~b'1011' = 0; ==> TRUE
So, now comes the "dup" flag. Since 'dad' has dup letters, but 'bad' did not, your rules say that it is not a match. But it took the "dup" to finish the decision.
If you have not had a course in Boolean arithmetic, well, I have just presented the first couple of chapters. If I covered it too fast, find a math book on such and jump in. "It's not rocket science."
So, back to what code is needed to decide whether my_word has a subset of letters and whether it is allowed to have duplicate letters:
SELECT $my_mask & ~tbl.mask = 0, dup FROM tbl;
Then do the suitable AND / OR between to finish the logic.
With the limited Regex support on MySQL, best I can do is a PHP script for generating the query, presuming it only includes English letters. It seems making an expression to exclude invalid words is easier than one that includes them.
<?php
$inputword = str_split('msikes');
$counter = array();
for ($l = 'a'; $l < 'z'; $l++) {
$counter[$l] = 0;
}
foreach ($inputword as $l) {
$counter[$l]++;
}
$nots = '';
foreach ($counter as $l => $c) {
if (!$c) {
$nots .= $l;
unset($counter[$l]);
}
}
$conditions = array();
if(!empty($nots)) {
// exclude words that have letters not given
$conditions[] = "[" . $nots . "]'";
}
foreach ($counter as $l => $c) {
$letters = array();
for ($i = 0; $i <= $c; $i++) {
$letters[] = $l;
}
// exclude words that have the current letter more times than given
$conditions[] = implode('.*', $letters);
}
$sql = "SELECT word FROM words WHERE word NOT RLIKE '" . implode('|', $conditions) . "'";
echo $sql;
Something like this might work for you:
// Input Word
$WORD = strtolower('msikes');
// Alpha Array
$Alpha = range('a', 'z');
// Turn it into letters.
$Splited = str_split($WORD);
$Letters = array();
// Count occurrence of each letter, use letter as key to make it unique
foreach( $Splited as $Letter ) {
$Letters[$Letter] = array_key_exists($Letter, $Letters) ? $Letters[$Letter] + 1 : 1;
}
// Build a list of letters that shouldn't be present in the word
$ShouldNotExists = array_filter($Alpha, function ($Letter) use ($Letters) {
return ! array_key_exists($Letter, $Letters);
});
#### Building SQL Statement
// Letters to skip
$SkipLetters = array();
foreach( $ShouldNotExists as $SkipLetter ) {
$SkipLetters[] = "`has_{$SkipLetter}` = 0";
}
// count condition (for multiple occurrences)
$CountLetters = array();
foreach( $Letters as $K => $V ) {
$CountLetters[] = "`count_{$K}` <= {$V}";
}
$SQL = 'SELECT `word` FROM `words` WHERE '.PHP_EOL;
$SQL .= '('.implode(' AND ', $SkipLetters).')'.PHP_EOL;
$SQL .= ' AND ('.implode(' AND ', $CountLetters).')'.PHP_EOL;
$SQL .= ' ORDER BY LENGTH(`word`), `word`'.PHP_EOL;
echo $SQL;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With