Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strange Java Unicode Regular Expression StringIndexOutOfBoundsException

My question is quite simple yet puzzling. It could be that there is a simple switch which fixes this but I'm not much experienced in Java regexes...

String line = "💕💕💕";
line.replaceAll("(?i)(.)\\1{2,}", "$1");

This crashes. If I remove the (?i) switch, it works. The three unicode characters are not random, they were found amidst a big Korean text, but I don't know they are valid or not.

Strange thing is that the regex works for all the other text but this. Why do I get the error?

This is the exception I get

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 6
    at java.lang.String.charAt(String.java:658)
    at java.lang.Character.codePointAt(Character.java:4668)
    at java.util.regex.Pattern$CIBackRef.match(Pattern.java:4846)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4125)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Start.match(Pattern.java:3408)
    at java.util.regex.Matcher.search(Matcher.java:1199)
    at java.util.regex.Matcher.find(Matcher.java:592)
    at java.util.regex.Matcher.replaceAll(Matcher.java:902)
    at java.lang.String.replaceAll(String.java:2162)
    at tokenizer.Test.main(Test.java:51)
like image 534
binit Avatar asked Apr 15 '13 06:04

binit


People also ask

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string).

What is \\ w+ in Java regex?

\\W+ matches all characters except alphanumeric characters and _ . They are opposite.

How do you escape a Metacharacter in Java?

Regex patterns use \ as escape character, but so does Java. So to get a single escape ( \ ) in a regex pattern you should write: \\ . To escape an escape inside a regex, double the pattern: \\\\ .

What is Unicode in regex?

Unicode Regular Expressions. Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead. With more and more software being required to support multiple languages, or even just any language, Unicode has been strongly gaining popularity in recent years.


2 Answers

The characters you mentioned are actually "Double byte characters". Which means that two bytes form one character. But for Java to interpret this, the encoding information (when it is different from the default platform encoding) needs to be passed explicitly (or else default platform encoding will be used).

To prove this, consider following

String line = "💕💕💕";
System.out.println(line.length());

this prints the length as 6 ! Whereas we only have three characters,

now the following code

String line1 = new String("💕💕💕".getBytes(),"UTF-8");
System.out.println(line1.length());

prints length as 3 which intended.

if you replace the line

String line = "💕💕💕";

with

 String line1 = new String("💕💕💕".getBytes(),"UTF-8");

it works and regex does not fail. I have used UTF-8 here. Please use the appropriate encoding of your intended platform.

Java regex libraries depend heavily on Character Sequence which in turn depends on the encoding scheme. For the strings having character encoding different from the default encoding, characters cannot be decoded correctly (it showed 6 chars instead of 3 !) and hence regex fails.

like image 196
Santosh Avatar answered Oct 01 '22 07:10

Santosh


What's explained by Santosh in this answer is incorrect. This can be demonstrated by running

String str = "💕💕💕";
System.out.println("code point: " + .codePointAt(0));

which will output (at least for me) the value 128149, which is confirmed by this page as correct. So Java does not interpret the string in a wrong way. It did interpret it wrong when using the getBytes() method.

However, as explained by OP, it seems the regular expression crashes on that. I have no other explanation for it as it being a bug in java. Either that, or then it doesn't support UTF-16 fully by design.

Edit:

based on this answer:

the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!

It would seem that this is a limitation of regular expressions in java.


Since you commented that

it would be best if I could simply ignore the UTF-16 characters and apply the regex rather than throw an exception.

This can certainly be done. A straightforward way is to only apply your regex to a certain range. Filtering unicode character ranges has been explained in this answer. Based on that answer, example that doesn't seem to choke but just leaves the problem characters alone:

line.replaceAll("(?Ui)([\\u0000-\\uffff])\\1{2,}", "$1")    

// "💕💕💕" -> "💕💕💕"
// "foo 💕💕💕 foo" -> "foo 💕💕💕 foo"
// "foo aAa foo" -> "foo a foo"
like image 42
eis Avatar answered Oct 01 '22 07:10

eis