Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Unicode Character 'MINUS SIGN' (U+2212) is NOT in regex unicode group \p{Pd} (Dash_Punctuation)?

Tags:

regex

unicode

I'm trying to collect all dash-signs to use it while analyzing raw text data. I've found that Unicode regexp \p{Pd} should match all cases, but after all, it turned out that this character doesn't match!

Here is more info about this char: https://www.fileformat.info/info/unicode/char/2212/index.htm

Is it a bug or a feature? Practically it's not useful stuff.

like image 610
Siarhei Avatar asked Dec 28 '25 14:12

Siarhei


1 Answers

The Unicode character U+2212 MINUS SIGN is a math-related symbol, and is probably not considered as a punctuation mark; for instance, it is matched by \p{Math} but not by \p{Punctuation} (which includes \p{Dash_Punctuation}).

You may want to try using \p{Dash} instead, and check whether it covers all your needs or not...

Ref: Properties for U+2212

Edit:

Here is an "official" list of all the characters having a Dash Unicode property: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Dash=Yes:], including the U+2212 MINUS SIGN character.

In Unicode 12.0, the JavaScript regular expression:

/\p{Dash}/u

would be equivalent to:

/[\u002D\u058A\u05BE\u1400\u1806\u2010\u2011\u2012\u2013\u2014\u2015\u2053\u207B\u208B\u2212\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]/

like image 53
Jigorodake Avatar answered Dec 30 '25 12:12

Jigorodake



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!