From a list of strings, I want to extract all words and save extend them to a new list. I was successful to do so using pattern matching in the form of:
import re
p = re.compile('[a-z]+', re.IGNORECASE)
p.findall("02_Sektion_München_Gruppe_Süd")
Unfortunately, the language contains language-specific characters, so that strings in the form of the given example yields:
['Sektion', 'M', 'nchen', 'Gruppe', 'S', 'd']
I want it to yield:
['Sektion', 'München', 'Gruppe', 'Süd']
I am grateful for suggestions how to solve this problem.
You may use
import re
p = re.compile(r'[^\W\d_]+')
print(p.findall("02_Sektion_München_Gruppe_Süd"))
# => ['Sektion', 'München', 'Gruppe', 'Süd']
See the Python 3 demo.
The [^\W\d_]+ pattern matches any 1+ chars that are not non-word, digits and _, that is, that are only letters.
In Python 2.x you will have to add re.UNICODE flag to make it match Unicode letters:
p = re.compile(r'[^\W\d_]+', re.U)
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With