Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex match Arabic keyword

I have simple regex which founds some word in text:

var patern = new RegExp("\bsomething\b", "gi");

This match word in text with spaces or punctuation around.

So it match:

I have something.

But doesn't match:

I havesomething.

what is fine and exactly what I need.

But I have issue with for example Arabic language. If I have regex:

var patern = new RegExp("\bرياضة\b", "gi");

and text:

رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي 

The keyword which I am looking for is at the end of the text.

But this doesn't work, it just doesn't find it.

It works if I remove \b from regex:

var patern = new RegExp("رياضة", "gi");

But that is now what I want, because I don't want to find it if it's part of another word like in english example above:

 I havesomething.

So I really have low knowledge about regex and if anyone can help me to work this with english and languages like arabic.

like image 375
carpics Avatar asked Nov 21 '16 23:11

carpics


3 Answers

We have first to understand what does \b mean:

\b is an anchor that matches at a position that is called a "word boundary".

In your case, the word boundaries that you are looking for are not having other Arabic letters.

To match only Arabic letters in Regex, we use unicode:

[\u0621-\u064A]+

Or we can simply use Arabic letters directly

[ء-ي]+

The code above will match any Arabic letters. To make a word boundary out of it, we could simply reverse it on both sides:

[^ء-ي]ARABIC TEXT[^ء-ي]

The code above means: don't match any Arabic characters on either sides of an Arabic word which will work in your case.

Consider this example that you gave us which I modified a little bit:

 أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا 

If we are trying to match only رياض, this word will make our search match also رياضة, رياضيات, and رياضتي. However, if we add the code above, the match will successfully be on رياض only.

var x = " أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا ";
x = x.replace(/([^ء-ي]رياض[^ء-ي])/g, '<span style="color:red">$1</span>');
document.write (x);

If you would like to account for أآإا with one code, you could use something like this [\u0622\u0623\u0625\u0627] or simply list them all between square brackets [أآإا]. Here is a complete code

var x = "أنا هنا وانا هناك .. آنا هنا وإنا هناك";
x = x.replace(/([أآإا]نا)/g, '<span style="color:red">$1</span>');
document.write (x);

Note: If you want to match every possible Arabic characters in Regex including all Arabic letters أ ب ت ث ج, all diacritics َ ً ُ ٌ ِ ٍ ّ, and all Arabic numbers ١٢٣٤٥٦٧٨٩٠, use this regex: [،-٩]+

Useful link about the ranking of Arabic characters in Unicode: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode

like image 160
Ibrahim Avatar answered Oct 19 '22 16:10

Ibrahim


This doesn't work because of the Arabic language which isn't supported on the regex engine. You could search for the unicode chars in the text (Unicode ranges).

Or you could use encoding to convert the text into unicode and then make somehow the regex (i never have tried this but it should work).

like image 26
german meza Avatar answered Oct 19 '22 17:10

german meza


I used this ء-ي٠-٩ and it works for me

like image 1
Salma Gomaa Avatar answered Oct 19 '22 17:10

Salma Gomaa