I m trying to match unicode characters in Java.
Input String: informa
String to match : informátion
So far I ve tried this:
Pattern p= Pattern.compile("informa[\u0000-\uffff].*", (Pattern.UNICODE_CASE|Pattern.CANON_EQ|Pattern.CASE_INSENSITIVE));
String s = "informátion";
Matcher m = p.matcher(s);
if(m.matches()){
System.out.println("Match!");
}else{
System.out.println("No match");
}
It comes out as "No match". Any ideas?
Unicode character literals To print Unicode characters, enter the escape sequence “u”. Unicode sequences can be used everywhere in Java code. As long as it contains Unicode characters, it can be used as an identifier.
Level 1 is the minimally useful level of support for Unicode. All regex implementations dealing with Unicode should be at least at Level 1. Level 2 is recommended for implementations that need to handle additional Unicode features.
Unicode is a 16-bit character encoding system. The lowest value is \u0000 and the highest value is \uFFFF. UTF-8 is a variable width character encoding. UTF-8 has the ability to be as condensed as ASCII but can also contain any Unicode characters with some increase in the size of the file.
U (Unicode dependent), and re. X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.
The term "Unicode characters" is not specific enough. It would match every character which is in the Unicode range, thus also "normal" characters. This term is however very often used when one actually means "characters which are not in the printable ASCII range".
In regex terms that would be [^\x20-\x7E]
.
boolean containsNonPrintableASCIIChars = string.matches(".*[^\\x20-\\x7E].*");
Depending on what you'd like to do with this information, here are some useful follow-up answers:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With