Why does this code (that contains an umlaut):
text = "Some super text with a german umlaut Wirtschaftsprüfer"
words = text.split(/\W+/)
words.each do |w|
puts w
end
Return this result (that does not retain the previously-given umlaut):
=> Some
=> super
=> text
=> with
=> a
=> german
=> umlaut
=> Wirtschaftspr
=> fer
Is there a way I can retain an umlaut when splitting a string in Ruby 1.9+?
EDIT: I use ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin11.4.2]
[\W] just matches non word characters, i.e., it's equivalent to [^a-zA-Z0-9_], and so does not include (exclude?) special characters and diacritics. You can use
words = text.split(/[^[:word:]]/)
which matches all Unicode "word" characters, or
words = text.split(/[^\p{Latin}]/)
which matches characters in the Unicode Latin script.
Note that both of these will match special characters from other languages, not just German.
See http://www.ruby-doc.org/core-1.9.3/Regexp.html and look for (1) "Character Classes" and (2) "Character Properties."
You could replace /\W+/ by /\s+/ (\s matches space characters: space, tabs, new lines)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With