I have a file and I want to remove all non-word characters from it, with the exception of ä, ö and ü, which are mutated vowels in the German language. Is there a way to do word.gsub!(/\W/, '') and put exceptions in it?
Example:
text = "übung bzw. äffchen"
text.gsub!(/\W/, '').
Now it would return "bungbzwffchen". It deletes the non word characters, but also removes the mutated vowels ü and ä, which I want to keep.
You may be able to define a list of exclusions by using some kind of negative-lookback thing, but the simplest I think would be to just use \w instead of \W and negate the whole group:
word.gsub!(/[^\wÄäÖöÜü]/, '')
You could also use word.gsub(/[^\p{Letter}]/, ''), that should get rid of any characters that are not listed as "Letter" in unicode.
You mention German vowels in your question, I think it's worth noting here that the German alphabet also includes the long-s : ẞ / ß
Update:
To answer your original question, to define a list of exclusions, you use the "negative look-behind" (?<!pat):
word.gsub(/\W(?<![ÄäÖöÅåẞß])/, '')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With