Strange Java Unicode Regular Expression StringIndexOutOfBoundsException

Tags:

My question is quite simple yet puzzling. It could be that there is a simple switch which fixes this but I'm not much experienced in Java regexes...

String line = "💕💕💕";
line.replaceAll("(?i)(.)\\1{2,}", "$1");

This crashes. If I remove the (?i) switch, it works. The three unicode characters are not random, they were found amidst a big Korean text, but I don't know they are valid or not.

Strange thing is that the regex works for all the other text but this. Why do I get the error?

This is the exception I get

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 6
    at java.lang.String.charAt(String.java:658)
    at java.lang.Character.codePointAt(Character.java:4668)
    at java.util.regex.Pattern$CIBackRef.match(Pattern.java:4846)
    at java.util.regex.Pattern$Curly.match(Pattern.java:4125)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Start.match(Pattern.java:3408)
    at java.util.regex.Matcher.search(Matcher.java:1199)
    at java.util.regex.Matcher.find(Matcher.java:592)
    at java.util.regex.Matcher.replaceAll(Matcher.java:902)
    at java.lang.String.replaceAll(String.java:2162)
    at tokenizer.Test.main(Test.java:51)

534

asked Apr 15 '13 06:04

binit

2 Answers

The characters you mentioned are actually "Double byte characters". Which means that two bytes form one character. But for Java to interpret this, the encoding information (when it is different from the default platform encoding) needs to be passed explicitly (or else default platform encoding will be used).

To prove this, consider following

String line = "💕💕💕";
System.out.println(line.length());

this prints the length as 6 ! Whereas we only have three characters,

now the following code

String line1 = new String("💕💕💕".getBytes(),"UTF-8");
System.out.println(line1.length());

prints length as 3 which intended.

if you replace the line

String line = "💕💕💕";

with

 String line1 = new String("💕💕💕".getBytes(),"UTF-8");

it works and regex does not fail. I have used UTF-8 here. Please use the appropriate encoding of your intended platform.

Java regex libraries depend heavily on Character Sequence which in turn depends on the encoding scheme. For the strings having character encoding different from the default encoding, characters cannot be decoded correctly (it showed 6 chars instead of 3 !) and hence regex fails.

196

answered Oct 01 '22 07:10

Santosh

What's explained by Santosh in this answer is incorrect. This can be demonstrated by running

String str = "💕💕💕";
System.out.println("code point: " + .codePointAt(0));

which will output (at least for me) the value 128149, which is confirmed by this page as correct. So Java does not interpret the string in a wrong way. It did interpret it wrong when using the getBytes() method.

However, as explained by OP, it seems the regular expression crashes on that. I have no other explanation for it as it being a bug in java. Either that, or then it doesn't support UTF-16 fully by design.

Edit:

based on this answer:

the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!

It would seem that this is a limitation of regular expressions in java.

Since you commented that

it would be best if I could simply ignore the UTF-16 characters and apply the regex rather than throw an exception.

This can certainly be done. A straightforward way is to only apply your regex to a certain range. Filtering unicode character ranges has been explained in this answer. Based on that answer, example that doesn't seem to choke but just leaves the problem characters alone:

line.replaceAll("(?Ui)([\\u0000-\\uffff])\\1{2,}", "$1")    

// "💕💕💕" -> "💕💕💕"
// "foo 💕💕💕 foo" -> "foo 💕💕💕 foo"
// "foo aAa foo" -> "foo a foo"

answered Oct 01 '22 07:10

eis

Related questions
                            
                                Mini-OSGi that can run in a sandbox (like AppEngine or WebStart)?
                            
                                Is it possible to reference a nested generic parameter in java?
                            
                                Anyone have experience with AppScale?
                            
                                How to toggle orientation lock in android?
                            
                                java.net.SocketException: Software caused connection abort: socket write error [duplicate]
                            
                                Who Uses Software Watermarking?
                            
                                Why does Java have NullPointerException instead of NullReferenceException? [duplicate]
                            
                                Export JPanel Graphics to .png or .gif or .jpg
                            
                                Horrible performance loss when using Opengl FBO
                            
                                Checking for deep equality in JUnit tests
                            
                                How to share data with two(2) SwingWorker class in Java
                            
                                How to run outstanding tasks immediately after ExecutorService.shutdown()?
                            
                                JAXB and Guice: How to integrate and visualize?
                            
                                Is it safe to reinsert the entry from Guava RemovalListener?
                            
                                Translucent JFrame border JDK 7
                            
                                Determining if a method overrides another at runtime
                            
                                Is there any faster way to iterate through rows from Sqlite query?
                            
                                Making an HTTPS connection using URL.openConnection()
                            
                                Is there any available API to represent various units of item like KG, Litre, Metre, KM, etc
                            
                                what does Maven -> Update Project... exactly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Strange Java Unicode Regular Expression StringIndexOutOfBoundsException

Tags:

java

regex

unicode

binit

People also ask

2 Answers

Santosh

eis

Recent Activity

Donate For Us