Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching Arabic names discard the differences between "أ" , "ا" in mysql

I am storing Arabic name in my database. In Arabic there are some letters that may written in different format such as "ا"، "أ"، "آ" it all represent the same letter. Also, "ه" ، "ة".

I need to search database for names and ignoring the differences between "ا"، "أ"، "آ" and also the differences between "ه" ، "ة".

So, for example when user enter in search box "اسامة" it should return "أسامة"، "اسامة"، "أسامه، "اسامه ... etc. Another example, "فايز"، "فائز" should return both.

How I could do that using mysql query? How I can search for similar names not the same name?

I tried Like keyword, but it not working.

select * from employee WHERE fname like "%أسامة%" and mname="علي" and lname="الجاسم"
like image 707
A. Gh Avatar asked Apr 17 '17 00:04

A. Gh


2 Answers

Update: I rewrote my answer in case someone needs a solution and stumbles to this question.

There are 3 possible solutions I know for this problem:

  1. Create a custom collation
  2. Add a normalized field
  3. Use regular expressions in queries

I wrote a tutorial showing how to apply these solutions to MySQL. I will try to summarize this steps here.

1. Create a custom collation

You can create a custom collation make MySQL deal with these characters as one. Custom collation can be added to a file called Index.xml located in the directory of charsets. The location of the file can be found by querying the information_schema with the following:

SHOW VARIABLES LIKE 'character_sets_dir';

Navigate to the directory, back up the file, open it and scroll to element <charset name=”utf8″>, add the following XML:

<charset name="utf8">
.
.
.
  <collation name="utf8_arabic_ci" id="1029">
   <rules>
     <reset>\u0627</reset> <!-- Alef 'ا' -->
     <i>\u0623</i>        <!-- Alef With Hamza Above 'أ' -->
     <i>\u0625</i>        <!-- Alef With Hamza Below 'إ' -->
     <i>\u0622</i>        <!-- Alef With Madda Above 'آ' -->
   </rules>
   <rules>
     <reset>\u0629</reset> <!-- Teh Marbuta 'ة' -->
     <i>\u0647</i>        <!-- Heh 'ه' -->
   </rules>
   <rules>
     <reset>\u0000</reset> <!-- Ignore Tashkil -->
     <i>\u064E</i>        <!-- Fatha 'َ' -->
     <i>\u064F</i>        <!-- Damma 'ُ' -->
     <i>\u0650</i>        <!-- Kasra 'ِ' -->
     <i>\u0651</i>        <!-- Shadda 'ّ' -->
     <i>\u064F</i>        <!-- Sukun 'ْ' -->
     <i>\u064B</i>        <!-- Fathatan 'ً' -->
     <i>\u064C</i>        <!-- Dammatan 'ٌ' -->
     <i>\u064D</i>        <!-- Kasratan 'ٍ' -->
   </rules>
 </collation>
</charset>

This xml simply says that this collation is one of utf8 charsets, I chose to name it utf8_arabic_ci, and chose id number of 1029, custom collations ids are in the range 1024-2047. The collation rules tells MySQL to treat all forms of Alef as the same character, and also Teh and Heh, and ignore tashkil completely. You can add more rules if you like. Refer to MySQL docuemntation for more info about custom collations.

Now restart MySQL and change the collation of the column to our new collation with a query like:

ALTER TABLE persons MODIFY name VARCHAR(50) 
CHARACTER SET 'utf8' COLLATE 'utf8_arabic_ci';

You should be able to search for 'اسامة' and get 'اسامة', 'أسامه', 'أسَامَة' ... etc.

2. Add a normalized field

This solution requires adding a new field to the table. The field will be 'normalized', this is an example of a normalized Arabic first name field:

id normalized_name name
1  احمد            احمد
2  أحمد            احمد
3  أسامه          اسامة
4  أسامة          اسامة
5  اسامه          اسامة
6  اسَامه          اسامة

This normalized field can be created by adding a new column to the table and filling it with the result of a 'normalizing' function which replaces the different variation of a character with only one and also removes Tashkil. Now to make the search query work, we will query for the normalized field and display the original field. Something like the following:

SELECT name FROM persons WHERE normalized_name = "اسامة";
 
+--------------+
| name         |
+--------------+
| أسامه        |
| أسامة        |
| اسامه        |
| اسَامه        |
+--------------+

3. Using regular expressions in queries

I don't recommend this solution, you will lose the advantage of indices, will cost you on performance, and you will have a hard time generating regex patterns. But you might find it useful for testing or special queries.

You can use REGEX or its synonym RLIKE in MySQL queries. For example, if you want to find the name 'أحمد' with any of the variations of Alef, you will use a pattern like:

SELECT name FROM clients WHERE name REGEXP 'ا|أ|إ]حمد]'

This should show the required result, all you have to do is to write a function to generate this pattern for the search string. This is an example function but keep in mind this is just an example and won't work in all cases:

// Add all your patterns and replacement in these arrays
$patterns     = array( "/(ا|أ|آ)/", "/(ه|ة)/" ); 
$replacements = array( "[ا|أ|آ]",   "[ة|ه]" );   
$query_string = preg_replace($patterns, $replacements, $search_string);

This should work for Alef, Teh and Heh, but it won't work for the Tashkil.

Conclusion

Adding a custom collation I think is the best solution for most cases, but you might not be able to edit charset files (like if you are using a shared hosting for example), adding a normalized field will be the solution in this case, and you might find regex pattters useful in somecases.

like image 92
A.Essam Avatar answered Nov 09 '22 07:11

A.Essam


The way that I would handle this is by normalizing the data you're storing in your database. Create a new field in your DB, and run a script that normalizes the names and saves the normalized version in the new field. So, "أسامة"، "اسامة"، "أسامه، "اسامه" would all saved in the normalized field as اسامه, for example, and you would run your queries on the normalized field rather than on the raw name field.

like image 5
larapsodia Avatar answered Nov 09 '22 08:11

larapsodia