Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I replace non-printable Unicode characters in Java?

The following will replace ASCII control characters (shorthand for [\x00-\x1F\x7F]):

my_string.replaceAll("\\p{Cntrl}", "?");

The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20]), including accented characters:

my_string.replaceAll("[^\\p{Print}]", "?");

However, neither works for Unicode strings. Does anyone has a good way to remove non-printable characters from a unicode string?

like image 777
dagnelies Avatar asked Jun 01 '11 09:06

dagnelies


People also ask

How do I change Unicode in Java?

In order to convert Unicode to UTF-8 in Java, we use the getBytes() method. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. Declaration - The getBytes() method is declared as follows.


3 Answers

my_string.replaceAll("\\p{C}", "?");

See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.

like image 111
Op De Cirkel Avatar answered Oct 21 '22 12:10

Op De Cirkel


Op De Cirkel is mostly right. His suggestion will work in most cases:

myString.replaceAll("\\p{C}", "?");

But if myString might contain non-BMP codepoints then it's more complicated. \p{C} contains the surrogate codepoints of \p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It's possible this is a Java bug rather than intended behavior.

Using the other constituent categories is an option:

myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");

However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \p{C}:

StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
    int codePoint = myString.codePointAt(offset);
    offset += Character.charCount(codePoint);

    // Replace invisible control characters and unused code points
    switch (Character.getType(codePoint))
    {
        case Character.CONTROL:     // \p{Cc}
        case Character.FORMAT:      // \p{Cf}
        case Character.PRIVATE_USE: // \p{Co}
        case Character.SURROGATE:   // \p{Cs}
        case Character.UNASSIGNED:  // \p{Cn}
            newString.append('?');
            break;
        default:
            newString.append(Character.toChars(codePoint));
            break;
    }
}
like image 70
noackjr Avatar answered Oct 21 '22 13:10

noackjr


methods below for your goal

public static String removeNonAscii(String str)
{
    return str.replaceAll("[^\\x00-\\x7F]", "");
}

public static String removeNonPrintable(String str) // All Control Char
{
    return str.replaceAll("[\\p{C}]", "");
}

public static String removeSomeControlChar(String str) // Some Control Char
{
    return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}

public static String removeFullControlChar(String str)
{
    return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
} 
like image 10
Ali Bagheri Avatar answered Oct 21 '22 13:10

Ali Bagheri