Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

javascript+remove arabic text diacritic dynamically

how to remove dynamically Arabic diacritic I'm designing an ebook "chm" and have multi html pages contain Arabic text but some time the search engine want highlight some of Arabic words because its diacritic so is it possible when page load to use JavaScript functions that would strip the Arabic diacritic text ?? but must have option to enabled again so i don't want to remove it from HTML physically but temporary,

the thing is i don't know where to start and what is the right function to use

thank you :)

For Example

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين 
like image 892
Jomart Mirza Avatar asked Mar 07 '11 19:03

Jomart Mirza


5 Answers

I wrote this function which handles strings with mixed Arabic and English characters, removing special characters (including diacritics) and normalizing some Arabic characters like converting all ة's into ه's.

normalize_text = function(text) {

  //remove special characters
  text = text.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');

  //normalize Arabic
  text = text.replace(/(آ|إ|أ)/g, 'ا');
  text = text.replace(/(ة)/g, 'ه');
  text = text.replace(/(ئ|ؤ)/g, 'ء')
  text = text.replace(/(ى)/g, 'ي');

  //convert arabic numerals to english counterparts.
  var starter = 0x660;
  for (var i = 0; i < 10; i++) {
    text.replace(String.fromCharCode(starter + i), String.fromCharCode(48 + i));
  }

  return text;
}
<input value="الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ" type="text" id="input">
<button onclick="document.getElementById('input').value = normalize_text(document.getElementById('input').value)">Normalize</button>
like image 184
Rashad Saleh Avatar answered Nov 06 '22 00:11

Rashad Saleh


I tried the following solution and it works fine:

const str = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
const withoutDiacs = str.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
console.log(withoutDiacs); //الحمد لله رب العالمين
Reference: https://www.overdoe.com/javascript/2020/06/18/arabic-diacritics.html
like image 20
Ahmed Ismail Avatar answered Sep 19 '22 13:09

Ahmed Ismail


Try this

Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين 

http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/

The code is C# not javascript though. Still trying to figure out how to achieve this in javascript

EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.

var CHARCODE_SHADDA = 1617;
var CHARCODE_SUKOON = 1618;
var CHARCODE_SUPERSCRIPT_ALIF = 1648;
var CHARCODE_TATWEEL = 1600;
var CHARCODE_ALIF = 1575;

function isCharTashkeel(letter)
{
    if (typeof(letter) == "undefined" || letter == null)
        return false;

    var code = letter.charCodeAt(0);
    //1648 - superscript alif
    //1619 - madd: ~
    return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
}

function stripTashkeel(input)
{
  var output = "";
  //todo consider using a stringbuilder to improve performance
  for (var i = 0; i < input.length; i++)
  {
    var letter = input.charAt(i);
    if (!isCharTashkeel(letter)) //tashkeel
      output += letter;                                
  }


return output;                   
}

Edit: Here is another way to do it using BuckData http://qurandev.github.com/

Advantages Buck uses less bandwidth In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: http://jsfiddle.net/BrxJP/ You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (Fathah, Dammah, Kasrah). Which leads to more hits. Regex + buck text can lead to awesome optimizations. All the searches can be run locally. http://qurandev.appspot.com How data generated? just one-to-one mapping using: http://corpus.quran.com/java/buckwalter.jsp

like image 9
Sameer Alibhai Avatar answered Nov 06 '22 01:11

Sameer Alibhai


Here's a javascript code that can handle removing Arabic diacritics nearly all the time.

var arabicNormChar = {
    'ك': 'ک', 'ﻷ': 'لا', 'ؤ': 'و', 'ى': 'ی', 'ي': 'ی', 'ئ': 'ی', 'أ': 'ا', 'إ': 'ا', 'آ': 'ا', 'ٱ': 'ا', 'ٳ': 'ا', 'ة': 'ه', 'ء': '', 'ِ': '', 'ْ': '', 'ُ': '', 'َ': '', 'ّ': '', 'ٍ': '', 'ً': '', 'ٌ': '', 'ٓ': '', 'ٰ': '', 'ٔ': '', '�': ''
}

var simplifyArabic  = function (str) {
    return str.replace(/[^\u0000-\u007E]/g, function(a){ 
        var retval = arabicNormChar[a]
        if (retval == undefined) {retval = a}
        return retval; 
    }).normalize('NFKD').toLowerCase();
}

//now you can use simplifyArabic(str) on Arabic strings to remove the diacritics

Note: you may override the arabicNormChar to your own preferences.

like image 4
Sina Mansour L. Avatar answered Nov 06 '22 01:11

Sina Mansour L.


Use this regex to catch all tashkeel

[ؐ-ًؚٟ]

like image 2
Yusuf Avatar answered Nov 06 '22 01:11

Yusuf