unicode regex pattern not working

Tags:

I am trying to match some unicode charaters sequence:

Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
    String text = "\\n     \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n    <\\/span>\\n<br style=\\";
    Matcher match = pattern.matcher(text);

but doing so gives this exception:

Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
  \u05[dDeE][0-9a-fA-F]+
      ^

how can I use still use regex with some regex chars (like "[") to match unicode?

EDIT: I'm trying to parse some text. The text somewhere has a sequence of Unicode characters, which I know their code range.

Edit2: I am now using ranges instead : [\\u05d0-\\u05ea]{2,} but still can't match the text above

Edit3: ok, now it's working, the problem was I used two backslashes instead of one, both in the regex and text. The solution for this is, assuming I know there will be two chars or more:

[\u05d0-\u05ea]{2,}

953

asked Sep 04 '13 13:09

limido

1 Answers

Here is what causing the exception:

\\u05[dDeE][0-9a-fA-F]}{2,}
  ^^^^

The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \uNNNN so it is giving an exception, because \u requires four hexadecimal digits after it and there is only two of them, namely 05 so you need to change it to \\u0005 if that is what you actually want.

On the other hand, if you want to match \\u in the target string, then you need to quad escape each backslash \ like this \\\\ so to match \\u you need \\\\\\\\u.

\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}

Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this:

(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}

Edit: Since there is only one backslash in your target string then your regular expression should be:

(?:\\\\u05[dDeE][0-9a-fA-F]){2,}

This will match \u05db\u05d3\u05d5\u05e8\u05d2\u05dc in your string

<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);

Edit 2: If you want to match literal \u05db\u05d3\u05d5\u05e8\u05d2\u05dc then you can't use a range.

On the other hand, if you want to match Unicode code points between 05d0 and 05df then you can use:

(?:[\\u05d0\\u05df]){2,}

142

answered Sep 20 '22 17:09

Ibrahim Najjar

Related questions
                            
                                How to copy files out of the currently running jar
                            
                                Running UiAutomatorTestcase in AndroidJunit Test Project
                            
                                How to send EOF to a process in Java?
                            
                                How to set OnTouchListener for the entire screen?
                            
                                Using intern in java Strings
                            
                                Load a file from src folder into a reader
                            
                                How to inject parameters in enum constructor using Spring?
                            
                                Java KeyListener isn't detecting keyboard input
                            
                                GroupLayout: Vertical and Horizontal Groups
                            
                                Connecting to excel sheet using jdbc without specifying DSN to Excel sheet
                            
                                Why does JDBC use primitives instead of wrapper clasess? [closed]
                            
                                Java Interface - constants and static class in normal interface
                            
                                Online restaurant reservation system (data structures)
                            
                                javax.crypto.IllegalBlockSizeException: last block incomplete in decryption exception
                            
                                android service restarts itself on application killed
                            
                                JPA: how to use @ElementCollection annotation?
                            
                                Java, reference variables that point to the same object in the memory
                            
                                Why jvm expands byte & short to int before pushing on stack?
                            
                                How does Externalizable differ from Serializable? [duplicate]
                            
                                Why am I not forced to catch Exception here?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

unicode regex pattern not working

Tags:

java

regex

unicode

limido

People also ask

1 Answers

Ibrahim Najjar

Recent Activity

Donate For Us