Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove \u200B (Zero Length Whitespace Unicode Character) from String in Java?

My application is using Spring Integration for email polling from Outlook mailbox.

As, it is receiving the String (email body)from an external system (Outlook), So I have no control over it.

For Example,

String emailBodyStr= "rejected by sundar14-\u200B.";

Now I am trying to remove the unicode character \u200B from this String.

What I tried already.

Try#1:

emailBodyStr = emailBodyStr.replaceAll("\u200B", "");

Try#2:

`emailBodyStr = emailBodyStr.replaceAll("\u200B", "").trim();`

Try#3 (using Apache Commons):

StringEscapeUtils.unescapeJava(emailBodyStr);

Try#4:

StringEscapeUtils.unescapeJava(emailBodyStr).trim();

Nothing worked till now.

When I tried to print this String using below code.

logger.info("Comment BEFORE:{}",emailBodyStr);
logger.info("Comment AFTER :{}",emailBodyStr);

In Eclipse console, it is NOT printing unicode char,

Comment BEFORE:rejected by sundar14-​.

But the same code prints the unicode char in Linux console as below.

Comment BEFORE:rejected by sundar14-\u200B.

I read some examples where str.replace() is recommended, but please note that examples uses javascript, PHP and not Java.

like image 710
Sundararaj Govindasamy Avatar asked Mar 22 '17 18:03

Sundararaj Govindasamy


People also ask

How do you remove zero width space in Java?

Finally, I am able to remove 'Zero Width Space' character by using 'Unicode Regex'. String plainEmailBody = new String(); plainEmailBody = emailBodyStr. replaceAll("[\\p{Cf}]", "");

How do you remove zero width space from a string?

replace() method to remove the Unicode zero width non-joiner \u200c characters from the string. The same approach can be used to remove Unicode zero width space characters \u200b .

What does u200b mean?

Encoding. The zero-width space character is encoded in Unicode as U+200B ZERO WIDTH SPACE, and input in HTML as ​ , ​ or ​ .

How do you find the zero width space?

The zero width space is Unicode character U+200B. (HTML ​). It's remarkably hard to type. On Windows you can type Alt-8203.


1 Answers

Finally, I am able to remove 'Zero Width Space' character by using 'Unicode Regex'.

String plainEmailBody = new String();
plainEmailBody = emailBodyStr.replaceAll("[\\p{Cf}]", "");

Reference to find the category of Unicode characters.

  1. Character class from Java.

Character class from Java lists all of these unicode categories.

enter image description here

  1. Website: http://www.fileformat.info/

Character category

  1. Website: http://www.regular-expressions.info/ => Unicode Regular Expressions

Unicode Regex for \u200B character

Note 1: As I received this string from Outlook Email Body - none of the approaches listed in my question was working.

My application is receiving a String from an external system (Outlook), So I have no control over it.

Note 2: This SO answer helped me to know about Unicode Regular Expressions .

like image 62
Sundararaj Govindasamy Avatar answered Nov 15 '22 19:11

Sundararaj Govindasamy