I'm having a little trouble with Python regular expressions.
What is a good way to remove all characters in a string that are not letters or numbers?
Thanks!
Use the isalnum() Method to Remove All Non-Alphanumeric Characters in Python String. We can use the isalnum() method to check whether a given character or string is alphanumeric or not. We can compare each character individually from a string, and if it is alphanumeric, then we combine it using the join() function.
A common solution to remove all non-alphanumeric characters from a String is with regular expressions. The idea is to use the regular expression [^A-Za-z0-9] to retain only alphanumeric characters in the string. You can also use [^\w] regular expression, which is equivalent to [^a-zA-Z_0-9] .
Python String isalnum() Method The isalnum() method returns True if all the characters are alphanumeric, meaning alphabet letter (a-z) and numbers (0-9). Example of characters that are not alphanumeric: (space)!
[\w]
matches (alphanumeric or underscore).
[\W]
matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)
You need [\W_]
to remove ALL non-alphanumerics.
When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+
instead of doing it one at a time.
Now all you need is to define alphanumerics:
str
object, only ASCII A-Za-z0-9:
re.sub(r'[\W_]+', '', s)
str
object, only locale-defined alphanumerics:
re.sub(r'[\W_]+', '', s, flags=re.LOCALE)
unicode
object, all alphanumerics:
re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)
Examples for str
object:
>>> import re, locale >>> sall = ''.join(chr(i) for i in xrange(256)) >>> len(sall) 256 >>> re.sub('[\W_]+', '', sall) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' >>> locale.setlocale(locale.LC_ALL, '') 'English_Australia.1252' >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\ x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\ xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\ xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\ xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' # above output wrapped at column 80
Unicode example:
>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE) u'abAZ\xff\u0404'
In the char set matching rule [...]
you can specify ^
as first char to mean "not in"
import re re.sub("[^0-9a-zA-Z]", # Anything except 0..9, a..z and A..Z "", # replaced with nothing "this is a test!!") # in this string --> 'thisisatest'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With