Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I match unicode characters in Java

I m trying to match unicode characters in Java.

Input String: informa

String to match : informátion

So far I ve tried this:

Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
    String s = "informátion";
    Matcher m = p.matcher(s);
    if(m.matches()){
        System.out.println("Match!");
    }else{
        System.out.println("No match");
    }

It comes out as "No match". Any ideas?

like image 770
ankimal Avatar asked Jun 23 '10 16:06

ankimal


People also ask

How do you specify Unicode characters in Java?

Unicode character literals To print Unicode characters, enter the escape sequence “u”. Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier.

Does regex support Unicode?

Level 1 is the minimally useful level of support for Unicode. All regex implementations dealing with Unicode should be at least at Level 1. Level 2 is recommended for implementations that need to handle additional Unicode features.

What is Unicode format in Java?

Unicode is a 16-bit character encoding system. The lowest value is \u0000 and the highest value is \uFFFF. UTF-8 is a variable width character encoding. UTF-8 has the ability to be as condensed as ASCII but can also contain any Unicode characters with some increase in the size of the file.

What does \u mean in regex?

U (Unicode dependent), and re. X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.


1 Answers

The term "Unicode characters" is not specific enough. It would match every character which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actually means "characters which are not in the printable ASCII range".

In regex terms that would be [^\x20-\x7E].

boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");

Depending on what you'd like to do with this information, here are some useful follow-up answers:

  • Get rid of special characters
  • Get rid of diacritical marks
like image 141
BalusC Avatar answered Sep 23 '22 12:09

BalusC