Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I make my word unscrambler return more relevant results

Tags:

regex

php

mysql

I am building a word unscrambler (php/mysql) that takes user input of between 2 and 8 letters and returns words of between 2 and 8 letters that can be made from those letters, not necessarily using all of the letters, but definitely not including more letters than supplied.

The user will enter something like MSIKE or MSIKEI (two i's), or any combination of letters or multiple occurrences of a letter.

The query below will find all occurrences of words that contain M, S, I, K, or E.

However, the query below also returns words that have multiple occurrences of letters not requested. For example, the word meek would be returned, even though it has two e's and the user didn't enter two e's, or the word kiss, even though the user didn't enter s twice.

SELECT word
FROM words
WHERE word REGEXP '[msike]'
AND has_a=0
AND has_b=0
AND has_c=0
AND has_d=0
(we skip e) or we could add has_e=1
AND has_f=0
...and so on...skipping letters  m, s, i, k, and e
AND has_w=0
AND has_x=0
AND has_y=0
AND has_z=0

Note the columns has_a, has_b, etc are either 1 if the letter occurs in the word or 0 if not.

I am open to any changes to the table schema.

This site: http://grecni.com/texttwist.php is a good example of what I am trying to emulate.

Question is how to modify the query to not return words with multiple occurrences of a letter, unless the user specifically entered a letter multiple times. Grouping by word length would be an added bonus.

Thanks so much.


EDIT: I altered the db per the suggestion of @awei, The has_{letter} is now count_{letter} and stores the total number of occurrences of the respective letter in the respective word. This could be useful when a user enters a letter multiple times. example: user enters MSIKES (two s).

Additionally, I have abandoned the REGEXP approach as shown in the original SQL statement. Working on doing most of the work on the PHP side, but many hurdles still in the way.


EDIT: Included first 10 rows from table

id  word        alpha       otcwl   ospd    csw sowpods dictionary  enable  vowels  consonants  start_with  end_with    end_with_ing    end_with_ly end_with_xy count_a count_b count_c count_d count_e count_f count_g count_h count_i count_j count_k count_l count_m count_n count_o count_p count_q count_r count_s count_t count_u count_v count_w count_x count_y count_z q_no_u  letter_count    scrabble_points wwf_points  status  date_added  
1   aa          aa          1       0       0   1       1           1       aa                  a           a           0               0           0           2       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       2               2               2           1       2015-11-12 05:39:45
2   aah         aah         1       0       0   1       0           1       aa      h           a           h           0               0           0           2       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       3               6               5           1       2015-11-12 05:39:45
3   aahed       aadeh       1       0       0   1       0           1       aae     hd          a           d           0               0           0           2       0       0       1       1       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       5               9               8           1       2015-11-12 05:39:45
4   aahing      aaghin      1       0       0   1       0           1       aai     hng         a           g           1               0           0           2       0       0       0       0       0       1       1       1       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       6               10              11          1       2015-11-12 05:39:45
5   aahs        aahs        1       0       0   1       0           1       aa      hs          a           s           0               0           0           2       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       4               7               6           1       2015-11-12 05:39:45
6   aal         aal         1       0       0   1       0           1       aa      l           a           l           0               0           0           2       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       3               3               4           1       2015-11-12 05:39:45
7   aalii       aaiil       1       0       0   1       1           1       aaii    l           a           i           0               0           0           2       0       0       0       0       0       0       0       2       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       5               5               6           1       2015-11-12 05:39:45
8   aaliis      aaiils      1       0       0   1       0           1       aaii    ls          a           s           0               0           0           2       0       0       0       0       0       0       0       2       0       0       1       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       6               6               7           1       2015-11-12 05:39:45
9   aals        aals        1       0       0   1       0           1       aa      ls          a           s           0               0           0           2       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       4               4               5           1       2015-11-12 05:39:45
10  aardvark    aaadkrrv    1       0       0   1       1           1       aaa     rdvrk       a           k           0               0           0           3       0       0       1       0       0       0       0       0       0       1       0       0       0       0       0       0       2       0       0       0       1       0       0       0       0       0       8               16              17          1       2015-11-12 05:39:45
like image 648
Mark Avatar asked Jan 08 '16 02:01

Mark


4 Answers

Think you've already done the hard work with your revised schema. All you need to do now is modify the query to look for <= the number of counts of each letter as specified by the user.

E.g. if the user entered "ALIAS":

SELECT word
FROM words
WHERE count_a <= 2
  AND count_b <= 0
  AND count_c <= 0
  AND count_d <= 0
  AND count_e <= 0
  AND count_f <= 0
  AND count_g <= 0
  AND count_h <= 0
  AND count_i <= 1
  AND count_j <= 0
  AND count_k <= 0
  AND count_l <= 1
  AND count_m <= 0
  AND count_n <= 0
  AND count_o <= 0
  AND count_p <= 0
  AND count_q <= 0
  AND count_r <= 0
  AND count_s <= 1
  AND count_t <= 0
  AND count_u <= 0
  AND count_v <= 0
  AND count_w <= 0
  AND count_x <= 0
  AND count_y <= 0
  AND count_z <= 0
ORDER BY CHAR_LENGTH(word), word;

Note: As requested, this is ordering by word length, then alphabetically. Have used <= even for <= 0 just to make it easier to modify by hand for other letters.

This returns "aa", "aal" and "aals" (but not "aalii" or "aaliis" since they both have two "i"s).

See SQL Fiddle Demo.

like image 190
Steve Chambers Avatar answered Oct 20 '22 11:10

Steve Chambers


Since you have two different requirements, I suggest implementing both two different solutions.

Where you don't care about dup letters, build a SET datatype with the 26 letters. Populate the bits according what the word has. This ignores duplicate letters. This also facilitates looking for words with a subset of the letters: (the_set & ~the_letters) = 0.

Where you do care about dups, sort the letters in the word and store that as the key. "msike" becomes "eikms".

Build a table that contains 3 columns:

eikms -- non unique index on this
msike -- the real word - probably good to have this as the PRIMARY KEY
SET('m','s','i',','k','e') -- for the other situation.

msikei and meek would be entered as

eikms
msikei 
SET('m','s','i',','k','e') -- (or, if more convenient: SET('m','i','s','i',','k','e')

ekm
meek
SET('e','k','m')

REGEXP is not practical for your task.

Edit 1

I think you also need a column that indicates whether there are any doubled letters in the word. That way, you can distinguish that kiss is allowed for msikes but for for msike.

Edit 2

A SET or an INT UNSIGNED can hold 1 bit for each of the 26 letters -- 0 for not present, 1 for present.

msikes and msike would both go into the set with exactly 5 bits turned on. The value to INSERT would be 'm,s,i,k,e,s' for msikes. Since the rest needs to involve Boolean arithmetic, maybe it would be better to use INT UNSIGNED. So...

a is 1 (1 << 0)
b is 2 (1 << 1)
c is 4 (1 << 2)
d is 8 (1 << 3)
...
z is (1 << 25)

To INSERT you use the | operator. bad becomes

(1 << 1) | (1 << 0) | (1 << 3)

Note how the bits are laid out, with 'a' at the bottom:

SELECT BIN((1 << 1) | (1 << 0) | (1 << 3)); ==> 1011

Similarly 'ad' is 1001. So, does 'ad' match 'bad'? The answer comes from

SELECT b'1001' & ~b'1011' = 0; ==> 1 (meaning 'true')

That means that all the letters in 'ad' (1001) are found in 'bad' (1011). Let's introduce "bed", which is 11010.

SELECT b'11010' & ~b'1011' = 0; ==> FALSE because of 'e' (10000)

But 'dad' (1001) will work fine:

SELECT b'1001' & ~b'1011' = 0; ==> TRUE

So, now comes the "dup" flag. Since 'dad' has dup letters, but 'bad' did not, your rules say that it is not a match. But it took the "dup" to finish the decision.

If you have not had a course in Boolean arithmetic, well, I have just presented the first couple of chapters. If I covered it too fast, find a math book on such and jump in. "It's not rocket science."

So, back to what code is needed to decide whether my_word has a subset of letters and whether it is allowed to have duplicate letters:

SELECT $my_mask & ~tbl.mask = 0, dup FROM tbl;

Then do the suitable AND / OR between to finish the logic.

like image 35
Rick James Avatar answered Oct 20 '22 11:10

Rick James


With the limited Regex support on MySQL, best I can do is a PHP script for generating the query, presuming it only includes English letters. It seems making an expression to exclude invalid words is easier than one that includes them.

<?php
$inputword = str_split('msikes');
$counter = array();
for ($l = 'a'; $l < 'z'; $l++) {
    $counter[$l] = 0;
}
foreach ($inputword as $l) {
    $counter[$l]++;
}
$nots = '';
foreach ($counter as $l => $c) {
    if (!$c) {
        $nots .= $l;
        unset($counter[$l]);
    }
}
$conditions = array();
if(!empty($nots)) {
    // exclude words that have letters not given
    $conditions[] = "[" . $nots . "]'";
}
foreach ($counter as $l => $c) {
    $letters = array();
    for ($i = 0; $i <= $c; $i++) {
        $letters[] = $l;
    }
    // exclude words that have the current letter more times than given
    $conditions[] = implode('.*', $letters); 
}
$sql = "SELECT word FROM words WHERE word NOT RLIKE '" . implode('|', $conditions) . "'";
echo $sql;
like image 1
coladict Avatar answered Oct 20 '22 10:10

coladict


Something like this might work for you:

// Input Word
$WORD = strtolower('msikes');

// Alpha Array
$Alpha = range('a', 'z');

// Turn it into letters.
$Splited    = str_split($WORD);
$Letters    = array();
// Count occurrence of each letter, use letter as key to make it unique
foreach( $Splited as $Letter ) {
    $Letters[$Letter] = array_key_exists($Letter, $Letters) ? $Letters[$Letter] + 1 : 1;
}

// Build a list of letters that shouldn't be present in the word
$ShouldNotExists = array_filter($Alpha, function ($Letter) use ($Letters) {
    return ! array_key_exists($Letter, $Letters);
});

#### Building SQL Statement
// Letters to skip
$SkipLetters = array();
foreach( $ShouldNotExists as $SkipLetter ) {
    $SkipLetters[] = "`has_{$SkipLetter}` = 0";
}
// count condition (for multiple occurrences)
$CountLetters = array();
foreach( $Letters as $K => $V ) {
    $CountLetters[] = "`count_{$K}` <= {$V}";
}

$SQL = 'SELECT `word` FROM `words` WHERE '.PHP_EOL;
$SQL .= '('.implode(' AND ', $SkipLetters).')'.PHP_EOL;
$SQL .= ' AND ('.implode(' AND ', $CountLetters).')'.PHP_EOL;
$SQL .= ' ORDER BY LENGTH(`word`), `word`'.PHP_EOL;

echo $SQL;
like image 1
ahmad Avatar answered Oct 20 '22 11:10

ahmad