I'm working with Ruby's regex engine. I need to write a regex that does this
WIKI_WORD = /\b([a-z][\w_]+\.)?[A-Z][a-z]+[A-Z]\w*\b/
but will also work in other European languages besides English. I don't think that the character range [a-z] will cover lowercase letters in German, etc.
WIKI_WORD = /\b(\p{Ll}\w+\.)?\p{Lu}\p{Ll}+\p{Lu}\w*\b/u
should work in Ruby 1.9. \p{Lu}
and \p{Ll}
are shorthands for uppercase and lowercase Unicode letters. (\w
already includes the underscore)
See also this answer - you might need to run Ruby in UTF-8 mode for this to work, and possibly your script must be encoded in UTF-8, too.
James Grey wrote a series of articles on working with Unicode, UTF-8 and Ruby 1.8.7 and 1.9.2. They're important reading.
With Ruby 1.8.7, we could add:
#!/usr/bin/ruby -kU
require 'jcode'
and get partial UTF-8 support.
With 1.9.2 you can use:
# encoding: UTF-8
as the second line of your source file and that will tell Ruby to default to UTF-8. Grey's recommendation is we do that with all source we write from now on.
That will not affect external encoding when reading/writing text, only the encoding of the source code.
Ruby 1.9.2 doesn't extend the usual \w
, \W
and \s
character classes to handle UTF-8 or Unicode. As the other comments and answers said, only the POSIX and Unicode character-sets in regex do that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With