Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression containing unicode words

I'd like to match all strings containing a certain word. like:

String regex = (?:\P{L}|\W|^)(ベスパ)(?:\b|$)

however, the Pattern class doesn't compile it:

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 39
(?:\P{L}|\W|^)((?:ベス|ベス|ヘズ)(?:パ)|パ)|ハ)゚)(?:\b|$)

I already set unicode_case to compile param, not sure what's going wrong here

final Pattern pattern = Pattern.compile(regex, Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE| Pattern.CANON_EQ);

Thanks for help! :)

like image 791
Frost Avatar asked Apr 12 '11 21:04

Frost


1 Answers

From the error message given, which looks nothing at all like the String regex shown, I infer that the original pattern was essentially as follows, which I have taken the liberty to reformat, add symbolic constants to, and preface with line numbers that we might inspect and address it more easily.

(All non-trivial patterns should always be written in (?x) mode — even though Java fights against you here, you should still do it.)

  1     (?: \P{L} | \W | ^ )
  2     (
  3         (?: \N{KATAKANA LETTER BE} \N{KATAKANA LETTER SU}
  4           | \N{KATAKANA LETTER BE} \N{KATAKANA LETTER SU}
  5           | \N{KATAKANA LETTER HE} \N{KATAKANA LETTER ZU}
  6         )
  7         (?: \N{KATAKANA LETTER PA} )
  8     |
  9             \N{KATAKANA LETTER PA}
 10     )
 11 |
 12             \N{KATAKANA LETTER HA}
 13     )
 14     \N{COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK}
 15     )
 16     (?: \b | $ )

The first and last lines are wrong, but they are wrong in a semantic way related to Java’s broken regexes. They are not syntactically wrong.

As should now be apparent, the syntactic issue is that the close parentheses at lines 13 and 15 are spurious: they have no corresponding open parentheses.

The first and last lines notwithstanding, I am still trying to understand what it is you are truly trying to do here. Why the duplication of lines 3 and 4? That doesn’t do anything useful. And I can see no reason for the grouping at line 7.

Is the intent to allow the combining mark to apply to any of the preceding things?

As for the errors in the first and last lines, do I understand that a simple word boundary is all that you are looking for? Do you actually mean to include those boundary characters there as part of your match, or are you just trying to establish boundaries? Why are you saying a non-letter or a non-word?

Word characters do include letters, you know — at least, according to the Unicode spec they do, even if Java does get this wrong. Alas, you’ve just included a bunch of letters though because of the Java regex bug, so we will have to recode this once I understand what you really want.

If only you used something that was actually compliant with UTS#18, it would work ok, but as I presume you haven’t (I heard no mention of ICU), we’ll have to fix it along the lines I have previously outlined.

A lookbehind for either a non-word or the start of string would work for the first one, and a lookahead for either a non-word or the end of string would work for the last one. That is what \b is of course supposed to do when facing word characters as you have here, and it might even work out that way provided you stay clear of your non-word particle.

But until I can see more of the original intent, I don’t think I should say more.

like image 187
tchrist Avatar answered Oct 23 '22 08:10

tchrist