According to the Oniguruma documentation, the \d
character type matches:
decimal digit char
Unicode: General_Category -- Decimal_Number
However, scanning for \d
in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:
#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')
puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…
p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
Am I misreading the documentation? Why doesn't \d
match other Unicode numerals, and/or is there a way to make it do so?
Noted by Brian Candler on ruby-talk:
\w
only matches ASCII letters and digits, while [[:alpha:]]
matches the full set of Unicode letters.\d
only matches ASCII digits, while [[:digit:]]
matches the full set of Unicode numbers.The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w
in the same Oniguruma doc we see the text:
\w word character
Not Unicode: alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.
This would explain why \d
does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.
p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]
It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u
flag (e.g. /\w/u
) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)
Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:
[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."
Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc
Try the Unicode character class \p{N}
instead. That matches all Unicode digits. No idea why \d
isn't working.
\d
will only match for ASCII numbers by default. You can manually turn on Unicode matching in a regex using the (counter-intuitive) (?u)
syntax:
"𝟛".match(/(?u)\d/) # => #<MatchData "𝟛">
Alternatively, you can use "posix" or "unicode property" style in your regex, which don't require you to manually turn on Unicode matching:
/[[:digit:]]/ # posix style
/\p{Nd}/ # unicode property/category style
You can find more detailed information about how to do advanced matching for Unicode characters in Ruby in this blog post: https://idiosyncratic-ruby.com/30-regex-with-class.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With