Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Matching Unicode Dashes in Java Regular Expressions?

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:

private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");

which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:

String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);

No joy. For the sample input below, the dash is not detected, and titleSegmentSeparator.matcher(sectionTitle).find() returns false!

In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?

Sample input:

Study Summary (1 of 10) – Competition

S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

like image 459
Alterscape Avatar asked Jun 15 '10 13:06

Alterscape


1 Answers

You're mixing decimal (8211) and hexadecimal (0x8211).

\x and \u both expect a hexadecimal number, therefore you'd need to use \u2014 to match the em-dash, not \u8211 (and \x2D for the normal hyphen etc.).

But why not simply use the Unicode property "Dash punctuation"?

As a Java string: "\\s\\p{Pd}\\s"

like image 120
Tim Pietzcker Avatar answered Sep 22 '22 09:09

Tim Pietzcker