I'm trying to build a regexp in ruby to match alpha characters in UTF-8 like ñíóúü
, etc. I know /\p{Alpha}/i
works and /\p{L}/i
works too but what's the difference?
They seem to be equivalent. (Edit: sometimes, see the end of this answer)
It seems like Ruby supports \p{Alpha}
since version 1.9. In POSIX \p{Alpha}
is equal to \p{L&}
(for regular expressions with Unicode support; see here). This matches all characters that have an upper and lower case variant (see here). Unicase letters would not be matched (while they would be match by \p{L}
.
This does not seem to be true for Ruby (I picked a random Arabic character, since Arabic has a unicase alphabet):
\p{L}
(any letter) matches.\p{Lu}
, \p{Ll}
, \p{Lt}
don't match. As expected.p{L&}
doesn't match. As expected.\p{Alpha}
matches.Which seems to be a very good indication that \p{Alpha}
is just an alias for \p{L}
in Ruby. On Rubular you can also see that \p{Alpha}
was not available in Ruby 1.8.7.
Note that the i
modifier is irrelevant in any case, because both \p{Alpha}
and \p{L}
match both upper- and lower-case characters anyway.
EDIT:
A ha, there is a difference! I just found this PDF about Ruby's new regex engine (in use as of Ruby 1.9 as stated above). \p{Alpha}
is available regardless of encoding (and will probably just match [A-Za-z]
if there is no Unicode support), while \p{L}
is specifically a Unicode property. That means, \p{Alpha}
behaves exactly as in POSIX regexes, with the difference that here is corresponds to \p{L}
, but in POSIX it corresponds to \p{L&}
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With