how to remove dynamically Arabic diacritic I'm designing an ebook "chm" and have multi html pages contain Arabic text but some time the search engine want highlight some of Arabic words because its diacritic so is it possible when page load to use JavaScript functions that would strip the Arabic diacritic text ?? but must have option to enabled again so i don't want to remove it from HTML physically but temporary,
the thing is i don't know where to start and what is the right function to use
thank you :)
For Example
Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين
I wrote this function which handles strings with mixed Arabic and English characters, removing special characters (including diacritics) and normalizing some Arabic characters like converting all ة's into ه's.
normalize_text = function(text) {
//remove special characters
text = text.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
//normalize Arabic
text = text.replace(/(آ|إ|أ)/g, 'ا');
text = text.replace(/(ة)/g, 'ه');
text = text.replace(/(ئ|ؤ)/g, 'ء')
text = text.replace(/(ى)/g, 'ي');
//convert arabic numerals to english counterparts.
var starter = 0x660;
for (var i = 0; i < 10; i++) {
text.replace(String.fromCharCode(starter + i), String.fromCharCode(48 + i));
}
return text;
}
<input value="الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ" type="text" id="input">
<button onclick="document.getElementById('input').value = normalize_text(document.getElementById('input').value)">Normalize</button>
I tried the following solution and it works fine:
const str = 'الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ';
const withoutDiacs = str.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
console.log(withoutDiacs); //الحمد لله رب العالمين
Try this
Text : الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
converted to : الحمد لله رب العالمين
http://www.suhailkaleem.com/2009/08/26/remove-diacritics-from-arabic-text-quran/
The code is C# not javascript though. Still trying to figure out how to achieve this in javascript
EDIT: Apparently it's very easy in javascript. The diacratics are stored as separate "letters" and they can be removed quite easily.
var CHARCODE_SHADDA = 1617;
var CHARCODE_SUKOON = 1618;
var CHARCODE_SUPERSCRIPT_ALIF = 1648;
var CHARCODE_TATWEEL = 1600;
var CHARCODE_ALIF = 1575;
function isCharTashkeel(letter)
{
if (typeof(letter) == "undefined" || letter == null)
return false;
var code = letter.charCodeAt(0);
//1648 - superscript alif
//1619 - madd: ~
return (code == CHARCODE_TATWEEL || code == CHARCODE_SUPERSCRIPT_ALIF || code >= 1612 && code <= 1631); //tashkeel
}
function stripTashkeel(input)
{
var output = "";
//todo consider using a stringbuilder to improve performance
for (var i = 0; i < input.length; i++)
{
var letter = input.charAt(i);
if (!isCharTashkeel(letter)) //tashkeel
output += letter;
}
return output;
}
Edit: Here is another way to do it using BuckData http://qurandev.github.com/
Advantages Buck uses less bandwidth In Javascript, u can search thru entire Buck quran text in 1 shot. intuitive compared to Arabic search Buck to Arabic and Arabic to Buck is a simple js call. Play with live sample here: http://jsfiddle.net/BrxJP/ You can strip out all vowels from Buck text in few millisecs. Why do this? u can search in javascript, ignoring the taskheel differences (Fathah, Dammah, Kasrah). Which leads to more hits. Regex + buck text can lead to awesome optimizations. All the searches can be run locally. http://qurandev.appspot.com How data generated? just one-to-one mapping using: http://corpus.quran.com/java/buckwalter.jsp
Here's a javascript code that can handle removing Arabic diacritics nearly all the time.
var arabicNormChar = {
'ك': 'ک', 'ﻷ': 'لا', 'ؤ': 'و', 'ى': 'ی', 'ي': 'ی', 'ئ': 'ی', 'أ': 'ا', 'إ': 'ا', 'آ': 'ا', 'ٱ': 'ا', 'ٳ': 'ا', 'ة': 'ه', 'ء': '', 'ِ': '', 'ْ': '', 'ُ': '', 'َ': '', 'ّ': '', 'ٍ': '', 'ً': '', 'ٌ': '', 'ٓ': '', 'ٰ': '', 'ٔ': '', '�': ''
}
var simplifyArabic = function (str) {
return str.replace(/[^\u0000-\u007E]/g, function(a){
var retval = arabicNormChar[a]
if (retval == undefined) {retval = a}
return retval;
}).normalize('NFKD').toLowerCase();
}
//now you can use simplifyArabic(str) on Arabic strings to remove the diacritics
Note: you may override the arabicNormChar to your own preferences.
Use this regex to catch all tashkeel
[ؐ-ًؚٟ]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With