Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Arabic text zero width joiners not working between elements

I am trying to implement a "Smart Search" feature which highlights text matches in a div as a user types a keyword. The highlighting works by using a regular expression to match the keyword in the div and replace it with

<span class="highlight">keyword</span>

The application supports both English and Arabic text. English works just fine, but when highlighting Arabic, the word "breaks" the word connection on the span rather than staying a single continuous word.

I'm trying to fix the issue by using 3 separate Regex expressions and adding zero width joiners appropriately to each case:

  • Match at the Beginning of a word

    var startsWithRegex = new RegExp("((^|\\s)" + keyword + ")", "gi");

    var newSpan = "<span class='highlight'>$1&zwj;</span>&zwj;";

  • Match in the Middle of a word (Note: There can be multiple middleOf matches in a single word)

    var middleOfRegex = new RegExp("([^(^|\\s)])(" + keyword + ")([^($|\\s)])", "gi");

    var newSpan = "&zwj;$1&zwj;<span class='highlight'>&zwj;$2&zwj;</span>&zwj;$3&zwj;";

  • Match at the End of a word

    var endsWithRegex = new RegExp("(" + keyword + "($|\\s))", "gi");

    var newSpan = "&zwj;<span class='highlight'>&zwj;$1</span>";

Both startsWithRegex and endsWithRegex appear to work as expected, but middleOfRegex is not. For example:

للأبد

transforms into:

ل‍‍ل‍‍أ‍بد

when the keyword is:

ل

I've tried other various combinations of &zwj; but nothing seems to be working. Is this a limitation of webkit? Is there another implementation I can use to get my desired result?

Thanks!



A few extra notes:

  • This is only happening for Webkit based browsers (Chrome specifically in my case) and we cannot use an alternative. I believe this bug is the root cause of the issue: https://bugs.webkit.org/show_bug.cgi?id=6148
  • This question is an extension on these two stackoverflow questions:

    Inserting HTML tag in the middle of Arabic word breaks word connection (cursive)

    Partially colored Arabic word in HTML

like image 943
Drew MacLaren Avatar asked Jan 04 '16 18:01

Drew MacLaren


1 Answers

Arabic language is a special case because the letter has different forms depending on its position in the word, I remember I solved such a problem using its Unicode, each letter’s form has different Unicode. You can find the Unicode table here

https://en.wikipedia.org/wiki/Arabic_script_in_Unicode You can get the Unicode value using

var code = $(selector).text().charCodeAt(0);
like image 99
Majali Avatar answered Sep 28 '22 03:09

Majali