Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove Unicode Zero Width Space PHP

I have a text in Burmese language, UTF-8. I am using PHP to work with the text. At some point along the way, some ZWSPs have crept in and I would like to remove them. I have tried two different ways of removing the characters, and neither seems to work.

First I have tried to use:

  $newBody = str_replace("​", "", $newBody);

to search for the HTML entity and remove it, as this is how it appears under Web Inspector. The spaces don't get removed. I have also tried it as:

  $newBody = str_replace("&#8203", "", $newBody);

and get the same no result.

The second method I tried was found on this question Remove ZERO WIDTH NON-JOINER character from a string in PHP

which looked like this:

 $newBody = str_replace("\xE2\x80\x8C", "", $newBody);

but I also got no result. The ZWSP was not removed.

An example word in the text ($newBody) looks like this : ယူ​​က​​ရိန်
And I want to make it look like this : ယူကရိန်း

Any ideas? Would a preg_replace work better somehow?

So I did try

$newBody = preg_replace("/\xE2\x80\x8B/", "", $newBody);

and it appears to be workings, but now there is another issue.

<a class="defined" title="Ukraine">ယူ&#8203;က&#8203;ရိန်း</a>

gets transformed into

<a class="defined _tt_t_" title="Ukraine" style="font-family: 'Masterpiece Uni Sans', TharLon, Myanmar3, Yunghkio, Padauk, Parabaik, 'WinUni Innwa', 'Win Uni Innwa', 'MyMyanmar Unicode', Panglong, 'Myanmar Sangam MN', 'Myanmar MN';">ယူကရိန်း</a>

I don't want it to add all that extra stuff. Any ideas why this is happening? Apart from coming up with some way to target only the text in between , is there another way to prevent the preg_replace from adding all this extra stuff? Btw, using google chrome on a mac. It seems to act a bit differently with firefox...

like image 577
Jimmy Long Avatar asked Mar 24 '14 02:03

Jimmy Long


People also ask

How do you get rid of zero-width space?

To remove zero-width space characters from a JavaScript string, we can use the JavaScript string replace method that matches all zero-width characters and replace them with empty strings. Zero-width characters in Unicode includes: U+200B zero width space. U+200C zero-width non-joiner Unicode code point.

How do you find the zero width of a character?

Format character that affects the layout of text or the operation of text processes, but is not normally rendered. Signified by the Unicode designation "Cf" (other, format). The value is 15. The unicode codepoint 0x200b is known as "zero width space".

What is zero-width space in HTML?

In HTML pages, the zero-width space can be used to mark a potential line break without hyphenation, as can the HTML element <wbr> ; for hyphenated line breaks, a soft hyphen is used. The zero-width space was not supported in some older web browsers.


1 Answers

This:

$newBody = str_replace("&#8203;", "", $newBody);

presumes the text is HTML entity encoded. This:

$newBody = str_replace("\xE2\x80\x8C", "", $newBody);

should work if the offending characters are not encoded, but matches the wrong character (0xe2808c). To match the same character as #8203; you need 0xe2808b:

$newBody = str_replace("\xE2\x80\x8B", "", $newBody);
like image 75
Jef Avatar answered Sep 24 '22 09:09

Jef