Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL query that can find Typos in Arabic language

I want to make a dictionary and I need to have a query that can find words with misspelling (I mean Typos / spelling error or typo), if the query could not find exact word then try to get with other spellings...

So what is Arabic: In Arabic there are some letter with (almost) same pronunciation but different letter, people sometimes don't know which one should use. For example there are 4 kind of Z in Arabic "ز / ظ / ذ / ض" pronunciation are different (a little) but people will forgot which one is the correct spelling. an example of one word with different using of "z" letter:

مریز / مریض / مریظ / مریذ

The correct is مریض

Here are other worlds that has more than one latter:

z: ض / ز / ذ / ظ 

T: ت / ط / 

S: ث / س / ص / 

Gh: ق / غ

So what is your idea? How should be the query?

I want if a user searched for "مریز" instead of showing him an error of 404 (not found) search the database with other letter (all Z) then return result if I found anything.

like image 843
kiokoshin Avatar asked Jan 09 '16 07:01

kiokoshin


People also ask

Does SQL support Arabic?

SQL Collation that supports both English and arabic language - Microsoft Q&A.

How do I find a specific letter in SQL?

SQL Server CHARINDEX() Function The CHARINDEX() function searches for a substring in a string, and returns the position.

How do I insert an Arabic character into an SQL database?

To insert the unicode characters in the database you have to send the text as unicode by using a parameter type like nvarchar / SqlDbType. NVarChar . (For completeness: if you are creating SQL dynamically (against common advice), you put an N before a string literal to make it unicode.

How do I match a word in SQL?

SQL pattern matching allows you to search for patterns in data if you don't know the exact word or phrase you are seeking. This kind of SQL query uses wildcard characters to match a pattern, rather than specifying it exactly. For example, you can use the wildcard "C%" to match any string beginning with a capital C.


Video Answer


3 Answers

In German, we have the same issue regarding t and tt or dt - especially in names.

One way to approach this would be to store additional normalized column containing the name / word with fixed transformation.

 tt -> t
 dt -> t
 ß  -> s
 ss -> s

So table would contain

 WORD    | NORMALIZED
 schmitt | schmit
 schmidt | schmit

At query time, apply these same transformations to the query and then compare against normalized column.

like image 136
Jan Avatar answered Oct 18 '22 23:10

Jan


There is an algorithm called Levenshtein distance (there are others as well), which tells the edit distance between two strings.

You could derive from this, try to find the most resembling words in your dictionary compared to your input.

Later you can assign weight for substitutions based on the letter tuples you mentioned to refine your search.

In fact there is an implementation for MySQL you definitely should check out: https://www.artfulsoftware.com/infotree/qrytip.php?id=552
Most of the levenshtein+mysql questions here in SO point to this page.

like image 4
Koshinae Avatar answered Oct 18 '22 23:10

Koshinae


A simpler solution would be using regular expressions within a like statement. For letters that are likely to be misspelled you can keep varieties in a regular expression wildcard. For the letters corresponding to z wildcard is "[زذظض]" You can replace all ز، ذ، ظ، ض letters with the wildcard and then query with a like statement:

select * from searched_table where word like "%[مرى[زذظض%"

After you find all versions of the searched word, you may either show the user all of them, or you may calculate the levenshtein distance (koshinae's answer) and show the closest words.

Edit: only for the letter Z, query would be like below

set @word = 'مرىض'; -- take this text from user
set @word = replace(@word, 'ذ', 'Z');
set @word = replace(@word, 'ظ', 'Z');
set @word = replace(@word, 'ض', 'Z');
set @word = replace(@word, 'ز', 'Z');
set @word = replace(@word, 'Z', '[زظضذ]');
set @word = Concat('%',  @word,  '%');
select @word;

select * from mydb.searchTable where word like @word;  
like image 2
Abdullah Nehir Avatar answered Oct 18 '22 21:10

Abdullah Nehir