I want to make a dictionary and I need to have a query that can find words with misspelling (I mean Typos / spelling error or typo), if the query could not find exact word then try to get with other spellings...
So what is Arabic: In Arabic there are some letter with (almost) same pronunciation but different letter, people sometimes don't know which one should use. For example there are 4 kind of Z in Arabic "ز / ظ / ذ / ض" pronunciation are different (a little) but people will forgot which one is the correct spelling. an example of one word with different using of "z" letter:
مریز / مریض / مریظ / مریذ
The correct is مریض
Here are other worlds that has more than one latter:
z: ض / ز / ذ / ظ
T: ت / ط /
S: ث / س / ص /
Gh: ق / غ
So what is your idea? How should be the query?
I want if a user searched for "مریز" instead of showing him an error of 404 (not found) search the database with other letter (all Z) then return result if I found anything.
SQL Collation that supports both English and arabic language - Microsoft Q&A.
SQL Server CHARINDEX() Function The CHARINDEX() function searches for a substring in a string, and returns the position.
To insert the unicode characters in the database you have to send the text as unicode by using a parameter type like nvarchar / SqlDbType. NVarChar . (For completeness: if you are creating SQL dynamically (against common advice), you put an N before a string literal to make it unicode.
SQL pattern matching allows you to search for patterns in data if you don't know the exact word or phrase you are seeking. This kind of SQL query uses wildcard characters to match a pattern, rather than specifying it exactly. For example, you can use the wildcard "C%" to match any string beginning with a capital C.
In German, we have the same issue regarding t and tt or dt - especially in names.
One way to approach this would be to store additional normalized column containing the name / word with fixed transformation.
tt -> t
dt -> t
ß -> s
ss -> s
So table would contain
WORD | NORMALIZED
schmitt | schmit
schmidt | schmit
At query time, apply these same transformations to the query and then compare against normalized column.
There is an algorithm called Levenshtein distance (there are others as well), which tells the edit distance between two strings.
You could derive from this, try to find the most resembling words in your dictionary compared to your input.
Later you can assign weight for substitutions based on the letter tuples you mentioned to refine your search.
In fact there is an implementation for MySQL you definitely should check out: https://www.artfulsoftware.com/infotree/qrytip.php?id=552
Most of the levenshtein+mysql questions here in SO point to this page.
A simpler solution would be using regular expressions within a like statement. For letters that are likely to be misspelled you can keep varieties in a regular expression wildcard. For the letters corresponding to z wildcard is "[زذظض]" You can replace all ز، ذ، ظ، ض letters with the wildcard and then query with a like statement:
select * from searched_table where word like "%[مرى[زذظض%"
After you find all versions of the searched word, you may either show the user all of them, or you may calculate the levenshtein distance (koshinae's answer) and show the closest words.
Edit: only for the letter Z, query would be like below
set @word = 'مرىض'; -- take this text from user
set @word = replace(@word, 'ذ', 'Z');
set @word = replace(@word, 'ظ', 'Z');
set @word = replace(@word, 'ض', 'Z');
set @word = replace(@word, 'ز', 'Z');
set @word = replace(@word, 'Z', '[زظضذ]');
set @word = Concat('%', @word, '%');
select @word;
select * from mydb.searchTable where word like @word;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With