Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retain Umlaut Character when using Split in Ruby

Tags:

split

ruby

Why does this code (that contains an umlaut):

text = "Some super text with a german umlaut Wirtschaftsprüfer"
words = text.split(/\W+/)
words.each do |w|
  puts w
end

Return this result (that does not retain the previously-given umlaut):

=> Some
=> super
=> text
=> with
=> a
=> german
=> umlaut
=> Wirtschaftspr
=> fer

Is there a way I can retain an umlaut when splitting a string in Ruby 1.9+?

EDIT: I use ruby 1.9.3p286 (2012-10-12 revision 37165) [x86_64-darwin11.4.2]

like image 582
Matthias Avatar asked Jun 05 '26 09:06

Matthias


2 Answers

[\W] just matches non word characters, i.e., it's equivalent to [^a-zA-Z0-9_], and so does not include (exclude?) special characters and diacritics. You can use

words = text.split(/[^[:word:]]/)

which matches all Unicode "word" characters, or

words = text.split(/[^\p{Latin}]/)

which matches characters in the Unicode Latin script.
Note that both of these will match special characters from other languages, not just German.

See http://www.ruby-doc.org/core-1.9.3/Regexp.html and look for (1) "Character Classes" and (2) "Character Properties."

like image 121
Reinstate Monica -- notmaynard Avatar answered Jun 07 '26 23:06

Reinstate Monica -- notmaynard


You could replace /\W+/ by /\s+/ (\s matches space characters: space, tabs, new lines)

like image 21
Baldrick Avatar answered Jun 08 '26 00:06

Baldrick