Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep only alphabetic characters (multilingual) in a string

On stackoverflow there are a lot of answers about how to keep only alphabetic characters from a string, the most common accepted is the famous regex '[^a-zA-Z]'. But this answer is totally wrong because it supposes everybody only write English... I thought I could down vote all these answers but I finally thought it would be more constructive to ask the question again, because I can't find the answer.

Is there an easy (or not...) way in python to keep only alphabetic characters from a string that works for all languages ? I think maybe about a library that could do like xregexp in javascript... By all languages I mean english but also french, russian, chinese, greec...etc

like image 392
Laurent Avatar asked Jun 27 '17 11:06

Laurent


1 Answers

[^\W\d_]

With Python3 or the re.UNICODE flag in Python2, you could use [^\W\d_].

\W : If UNICODE is set, this will match anything other than [0-9_] plus characters classified as not alphanumeric in the Unicode character properties database.

So [^\W\d_] is anything which is not not alphanumeric or not a digit or not an underscore. In other words, it's any alphabetic character. :)

>>> import re
>>> re.findall("[^\W\d_]", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']

Remove digits first, then look for "\w"

To avoid this convoluted logic, you could also remove digits and underscores first, and then look for alphanumeric characters :

>>> without_digit = re.sub("[\d_]", "", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE) 
>>> re.findall("\w", without_digit, re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']

regex module

It seems that regex module could help, since it understands \p{L} or [\w--\d_].

This regex implementation is backwards-compatible with the standard ‘re’ module, but offers additional functionality.

>>> import regex as re
>>> re.findall("\p{L}", "jüste Ä tösté 1234 ßÜ א д", re.UNICODE)
['j', 'ü', 's', 't', 'e', 'Ä', 't', 'ö', 's', 't', 'é', 'ß', 'Ü', 'א', 'д']

(Tested with Anaconda Python 3.6)

like image 145
Eric Duminil Avatar answered Nov 11 '22 07:11

Eric Duminil