Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove non-ASCII non-printable characters from a String

Tags:

I get user input including non-ASCII characters and non-printable characters, such as

\xc2d
\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0

for example:

email : [email protected]\xa0\xa0
street : 123 Main St.\xc2\xa0

desired output:

  email : [email protected]
  street : 123 Main St.

What is the best way to removing them using Java?
I tried the following, but doesn't seem to work

public static void main(String args[]) throws UnsupportedEncodingException {
        String s = "abc@gmail\\xe9.com";
        String email = "[email protected]\\xa0\\xa0";

        System.out.println(s.replaceAll("\\P{Print}", ""));
        System.out.println(email.replaceAll("\\P{Print}", ""));
    }

Output

abc@gmail\xe9.com
[email protected]\xa0\xa0
like image 986
daydreamer Avatar asked Jun 13 '12 18:06

daydreamer


People also ask

How do I remove non-printable characters from a string?

replaceAll("\\p{Cntrl}", "?"); The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20] ), including accented characters: my_string.

How do I remove non ASCII characters from a string?

Remove Non-ASCII Characters From Text Python Here we can use the replace() method for removing the non-ASCII characters from the string. In Python the str. replace() is an inbuilt function and this method will help the user to replace old characters with a new or empty string.

How do I remove non ASCII characters from a string in Python?

In python, to remove non-ASCII characters in python, we need to use string. encode() with encoding as ASCII and error as ignore, to returns a string without ASCII character use string. decode().

How do I remove non-printable characters from a string in PHP?

$string = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $string); It matches anything in range 0-31, 127-255 and removes it.


1 Answers

Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.

String clean = str.replaceAll("\\P{Print}", "");

Here, \p{Print} represents a POSIX character class for printable ASCII characters, while \P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because \ starts an escape sequence in string literals.)


Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.

This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.

That's my guess, but if I've misread the situation and you really do need to strip out literal \xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\\\x\\p{XDigit}{2}", "");

The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.

like image 66
erickson Avatar answered Oct 09 '22 05:10

erickson