Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove all non word characters except ä,ö and ü from a text using RegExp

Tags:

regex

ruby

I have a file and I want to remove all non-word characters from it, with the exception of ä, ö and ü, which are mutated vowels in the German language. Is there a way to do word.gsub!(/\W/, '') and put exceptions in it?

Example:

text = "übung bzw. äffchen"
text.gsub!(/\W/, '').

Now it would return "bungbzwffchen". It deletes the non word characters, but also removes the mutated vowels ü and ä, which I want to keep.

like image 932
Ashman Avatar asked Oct 17 '25 11:10

Ashman


1 Answers

You may be able to define a list of exclusions by using some kind of negative-lookback thing, but the simplest I think would be to just use \w instead of \W and negate the whole group:

word.gsub!(/[^\wÄäÖöÜü]/, '') 

You could also use word.gsub(/[^\p{Letter}]/, ''), that should get rid of any characters that are not listed as "Letter" in unicode.

You mention German vowels in your question, I think it's worth noting here that the German alphabet also includes the long-s : ẞ / ß

Update:

To answer your original question, to define a list of exclusions, you use the "negative look-behind" (?<!pat):

word.gsub(/\W(?<![ÄäÖöÅåẞß])/, '')
like image 189
Kimmo Lehto Avatar answered Oct 19 '25 07:10

Kimmo Lehto