Retain Umlaut Character when using Split in Ruby

Question

Why does this code (that contains an umlaut):

text = "Some super text with a german umlaut Wirtschaftsprüfer"
words = text.split(/\W+/)
words.each do |w|
  puts w
end

Return this result (that does not retain the previously-given umlaut):

=> Some
=> super
=> text
=> with
=> a
=> german
=> umlaut
=> Wirtschaftspr
=> fer

Is there a way I can retain an umlaut when splitting a string in Ruby 1.9+?

EDIT: I use ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin11.4.2]

Reinstate Monica -- notmaynard · Accepted Answer

[\W] just matches non word characters, i.e., it's equivalent to [^a-zA-Z0-9_], and so does not include (exclude?) special characters and diacritics. You can use

words = text.split(/[^[:word:]]/)

which matches all Unicode "word" characters, or

words = text.split(/[^\p{Latin}]/)

which matches characters in the Unicode Latin script.
Note that both of these will match special characters from other languages, not just German.

See http://www.ruby-doc.org/core-1.9.3/Regexp.html and look for (1) "Character Classes" and (2) "Character Properties."

Baldrick · Answer

You could replace /\W+/ by /\s+/ (\s matches space characters: space, tabs, new lines)

Retain Umlaut Character when using Split in Ruby

Tags:

split

ruby

Matthias

2 Answers

Reinstate Monica -- notmaynard

Baldrick

Recent Activity

Donate For Us

Retain Umlaut Character when using Split in Ruby

Tags:

split

ruby

Matthias

2 Answers

Reinstate Monica -- notmaynard

Baldrick

Related questions

Recent Activity

Donate For Us