My question is quite simple yet puzzling. It could be that there is a simple switch which fixes this but I'm not much experienced in Java regexes...
String line = "💕💕💕";
line.replaceAll("(?i)(.)\\1{2,}", "$1");
This crashes. If I remove the (?i)
switch, it works. The three unicode characters are not random, they were found amidst a big Korean text, but I don't know they are valid or not.
Strange thing is that the regex works for all the other text but this. Why do I get the error?
This is the exception I get
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 6
at java.lang.String.charAt(String.java:658)
at java.lang.Character.codePointAt(Character.java:4668)
at java.util.regex.Pattern$CIBackRef.match(Pattern.java:4846)
at java.util.regex.Pattern$Curly.match(Pattern.java:4125)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3694)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Start.match(Pattern.java:3408)
at java.util.regex.Matcher.search(Matcher.java:1199)
at java.util.regex.Matcher.find(Matcher.java:592)
at java.util.regex.Matcher.replaceAll(Matcher.java:902)
at java.lang.String.replaceAll(String.java:2162)
at tokenizer.Test.main(Test.java:51)
$ means "Match the end of the string" (the position after the last character in the string).
\\W+ matches all characters except alphanumeric characters and _ . They are opposite.
Regex patterns use \ as escape character, but so does Java. So to get a single escape ( \ ) in a regex pattern you should write: \\ . To escape an escape inside a regex, double the pattern: \\\\ .
Unicode Regular Expressions. Unicode is a character set that aims to define all characters and glyphs from all human languages, living and dead. With more and more software being required to support multiple languages, or even just any language, Unicode has been strongly gaining popularity in recent years.
The characters you mentioned are actually "Double byte characters". Which means that two bytes form one character. But for Java to interpret this, the encoding information (when it is different from the default platform encoding) needs to be passed explicitly (or else default platform encoding will be used).
To prove this, consider following
String line = "💕💕💕";
System.out.println(line.length());
this prints the length as 6 ! Whereas we only have three characters,
now the following code
String line1 = new String("💕💕💕".getBytes(),"UTF-8");
System.out.println(line1.length());
prints length as 3 which intended.
if you replace the line
String line = "💕💕💕";
with
String line1 = new String("💕💕💕".getBytes(),"UTF-8");
it works and regex does not fail. I have used UTF-8 here. Please use the appropriate encoding of your intended platform.
Java regex libraries depend heavily on Character Sequence which in turn depends on the encoding scheme. For the strings having character encoding different from the default encoding, characters cannot be decoded correctly (it showed 6 chars instead of 3 !) and hence regex fails.
What's explained by Santosh in this answer is incorrect. This can be demonstrated by running
String str = "💕💕💕";
System.out.println("code point: " + .codePointAt(0));
which will output (at least for me) the value 128149, which is confirmed by this page as correct. So Java does not interpret the string in a wrong way. It did interpret it wrong when using the getBytes() method.
However, as explained by OP, it seems the regular expression crashes on that. I have no other explanation for it as it being a bug in java. Either that, or then it doesn't support UTF-16 fully by design.
Edit:
based on this answer:
the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!
It would seem that this is a limitation of regular expressions in java.
Since you commented that
it would be best if I could simply ignore the UTF-16 characters and apply the regex rather than throw an exception.
This can certainly be done. A straightforward way is to only apply your regex to a certain range. Filtering unicode character ranges has been explained in this answer. Based on that answer, example that doesn't seem to choke but just leaves the problem characters alone:
line.replaceAll("(?Ui)([\\u0000-\\uffff])\\1{2,}", "$1")
// "💕💕💕" -> "💕💕💕"
// "foo 💕💕💕 foo" -> "foo 💕💕💕 foo"
// "foo aAa foo" -> "foo a foo"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With