I am trying to match some unicode charaters sequence:
Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
String text = "\\n \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n <\\/span>\\n<br style=\\";
Matcher match = pattern.matcher(text);
but doing so gives this exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
\u05[dDeE][0-9a-fA-F]+
^
how can I use still use regex with some regex chars (like "[") to match unicode?
EDIT: I'm trying to parse some text. The text somewhere has a sequence of Unicode characters, which I know their code range.
Edit2:
I am now using ranges instead : [\\u05d0-\\u05ea]{2,}
but still can't match the text above
Edit3: ok, now it's working, the problem was I used two backslashes instead of one, both in the regex and text. The solution for this is, assuming I know there will be two chars or more:
[\u05d0-\u05ea]{2,}
RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0.
The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.
\p{L} matches a single code point in the category "letter". \p{N} matches any kind of numeric character in any script. Source: regular-expressions.info.
\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.
Here is what causing the exception:
\\u05[dDeE][0-9a-fA-F]}{2,}
^^^^
The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \uNNNN
so it is giving an exception, because \u
requires four hexadecimal digits after it and there is only two of them, namely 05
so you need to change it to \\u0005
if that is what you actually want.
On the other hand, if you want to match \\u
in the target string, then you need to quad escape each backslash \
like this \\\\
so to match \\u
you need \\\\\\\\u
.
\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}
Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this:
(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}
Edit: Since there is only one backslash in your target string then your regular expression should be:
(?:\\\\u05[dDeE][0-9a-fA-F]){2,}
This will match \u05db\u05d3\u05d5\u05e8\u05d2\u05dc
in your string
<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);
Edit 2: If you want to match literal \u05db\u05d3\u05d5\u05e8\u05d2\u05dc
then you can't use a range.
On the other hand, if you want to match Unicode code points between 05d0
and 05df
then you can use:
(?:[\\u05d0\\u05df]){2,}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With