Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unicode regex pattern not working

I am trying to match some unicode charaters sequence:

Pattern pattern = Pattern.compile("\\u05[dDeE][0-9a-fA-F]{2,}");
    String text = "\\n     \\u05db\\u05d3\\u05d5\\u05e8\\u05d2\\u05dc\\n    <\\/span>\\n<br style=\\";
    Matcher match = pattern.matcher(text);

but doing so gives this exception:

Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal Unicode escape sequence near index 4
  \u05[dDeE][0-9a-fA-F]+
      ^

how can I use still use regex with some regex chars (like "[") to match unicode?

EDIT: I'm trying to parse some text. The text somewhere has a sequence of Unicode characters, which I know their code range.

Edit2: I am now using ranges instead : [\\u05d0-\\u05ea]{2,} but still can't match the text above

Edit3: ok, now it's working, the problem was I used two backslashes instead of one, both in the regex and text. The solution for this is, assuming I know there will be two chars or more:

[\u05d0-\u05ea]{2,}
like image 953
limido Avatar asked Sep 04 '13 13:09

limido


People also ask

Does regex work with Unicode?

RegexBuddy's regex engine is fully Unicode-based starting with version 2.0. 0.

What does \\ mean in Java regex?

The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.

What is \p l in regex?

\p{L} matches a single code point in the category "letter". \p{N} matches any kind of numeric character in any script. Source: regular-expressions.info.

What is the regex for Unicode paragraph seperator?

\u000d — Carriage return — \r. \u2028 — Line separator. \u2029 — Paragraph separator.


1 Answers

Here is what causing the exception:

\\u05[dDeE][0-9a-fA-F]}{2,}
  ^^^^

The java regular expression parser thinks you are trying to match a Unicode code point using the escape sequence \uNNNN so it is giving an exception, because \u requires four hexadecimal digits after it and there is only two of them, namely 05 so you need to change it to \\u0005 if that is what you actually want.

On the other hand, if you want to match \\u in the target string, then you need to quad escape each backslash \ like this \\\\ so to match \\u you need \\\\\\\\u.

\\\\\\\\u05[dDeE][0-9a-fA-F]}{2,}

Finally, if you want to match those Unicode code points literally in your target string then you need to modify our last expression a bit like this:

(?:\\\\\\\\u05[dDeE][0-9a-fA-F]){2,}

Edit: Since there is only one backslash in your target string then your regular expression should be:

(?:\\\\u05[dDeE][0-9a-fA-F]){2,}

This will match \u05db\u05d3\u05d5\u05e8\u05d2\u05dc in your string

<\/span><\/span><span dir=\"rtl\">\n \u05db\u05d3\u05d5\u05e8\u05d2\u05dc\n <\/span>\n<br style=\"clear : both; font-size : 1px;\">\n<\/div>"}, 200, null, null);

Edit 2: If you want to match literal \u05db\u05d3\u05d5\u05e8\u05d2\u05dc then you can't use a range.

On the other hand, if you want to match Unicode code points between 05d0 and 05df then you can use:

(?:[\\u05d0\\u05df]){2,}
like image 142
Ibrahim Najjar Avatar answered Sep 20 '22 17:09

Ibrahim Najjar