Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python remove anything that is not a letter or number

I'm having a little trouble with Python regular expressions.

What is a good way to remove all characters in a string that are not letters or numbers?

Thanks!

like image 210
Chris Dutrow Avatar asked Jun 12 '11 17:06

Chris Dutrow


People also ask

How do you get rid of non letters in Python?

Use the isalnum() Method to Remove All Non-Alphanumeric Characters in Python String. We can use the isalnum() method to check whether a given character or string is alphanumeric or not. We can compare each character individually from a string, and if it is alphanumeric, then we combine it using the join() function.

How do I get rid of non alphabetic characters?

A common solution to remove all non-alphanumeric characters from a String is with regular expressions. The idea is to use the regular expression [^A-Za-z0-9] to retain only alphanumeric characters in the string. You can also use [^\w] regular expression, which is equivalent to [^a-zA-Z_0-9] .

How do you find non-alphanumeric characters in Python?

Python String isalnum() Method The isalnum() method returns True if all the characters are alphanumeric, meaning alphabet letter (a-z) and numbers (0-9). Example of characters that are not alphanumeric: (space)!


2 Answers

[\w] matches (alphanumeric or underscore).

[\W] matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)

You need [\W_] to remove ALL non-alphanumerics.

When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+ instead of doing it one at a time.

Now all you need is to define alphanumerics:

str object, only ASCII A-Za-z0-9:

    re.sub(r'[\W_]+', '', s) 

str object, only locale-defined alphanumerics:

    re.sub(r'[\W_]+', '', s, flags=re.LOCALE) 

unicode object, all alphanumerics:

    re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE) 

Examples for str object:

>>> import re, locale >>> sall = ''.join(chr(i) for i in xrange(256)) >>> len(sall) 256 >>> re.sub('[\W_]+', '', sall) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' >>> locale.setlocale(locale.LC_ALL, '') 'English_Australia.1252' >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\ x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\ xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\ xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\ xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' # above output wrapped at column 80 

Unicode example:

>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE) u'abAZ\xff\u0404' 
like image 50
John Machin Avatar answered Sep 30 '22 07:09

John Machin


In the char set matching rule [...] you can specify ^ as first char to mean "not in"

import re re.sub("[^0-9a-zA-Z]",        # Anything except 0..9, a..z and A..Z        "",                    # replaced with nothing        "this is a test!!")    # in this string  --> 'thisisatest' 
like image 44
6502 Avatar answered Sep 30 '22 05:09

6502