I'm having a little trouble with Python regular expressions. What is a good way to remove all characters in a string that are not letters or numbers? Thanks!

<code>[\w]</code> matches (alphanumeric or underscore). <code>[\W]</code> matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore) You need <code>[\W_]</code> to remove ALL non-alphanumerics. When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using <code>[\W_]+</code> instead of doing it one at a time. Now all you need is to define alphanumerics: <code>str</code> object, only ASCII A-Za-z0-9: <pre class="prettyprint"><code> re.sub(r'[\W_]+', '', s) </code></pre> <code>str</code> object, only locale-defined alphanumerics: <pre class="prettyprint"><code> re.sub(r'[\W_]+', '', s, flags=re.LOCALE) </code></pre> <code>unicode</code> object, all alphanumerics: <pre class="prettyprint"><code> re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE) </code></pre> Examples for <code>str</code> object: <pre class="prettyprint"><code>>>> import re, locale >>> sall = ''.join(chr(i) for i in xrange(256)) >>> len(sall) 256 >>> re.sub('[\W_]+', '', sall) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' >>> locale.setlocale(locale.LC_ALL, '') 'English_Australia.1252' >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\ x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\ xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\ xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\ xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' # above output wrapped at column 80 </code></pre> Unicode example: <pre class="prettyprint"><code>>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE) u'abAZ\xff\u0404' </code></pre>

In the char set matching rule <code>[...]</code> you can specify <code>^</code> as first char to mean "not in" <pre class="prettyprint"><code>import re re.sub("[^0-9a-zA-Z]", # Anything except 0..9, a..z and A..Z "", # replaced with nothing "this is a test!!") # in this string --> 'thisisatest' </code></pre>

Python remove anything that is not a letter or number

2 Answers

[\w] matches (alphanumeric or underscore).

[\W] matches (not (alphanumeric or underscore)), which is equivalent to (not alphanumeric and not underscore)

You need [\W_] to remove ALL non-alphanumerics.

When using re.sub(), it will be much more efficient if you reduce the number of substitutions (expensive) by matching using [\W_]+ instead of doing it one at a time.

Now all you need is to define alphanumerics:

str object, only ASCII A-Za-z0-9:

    re.sub(r'[\W_]+', '', s)

str object, only locale-defined alphanumerics:

    re.sub(r'[\W_]+', '', s, flags=re.LOCALE)

unicode object, all alphanumerics:

    re.sub(ur'[\W_]+', u'', s, flags=re.UNICODE)

Examples for str object:

>>> import re, locale >>> sall = ''.join(chr(i) for i in xrange(256)) >>> len(sall) 256 >>> re.sub('[\W_]+', '', sall) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' >>> locale.setlocale(locale.LC_ALL, '') 'English_Australia.1252' >>> re.sub('[\W_]+', '', sall, flags=re.LOCALE) '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\ x9a\x9c\x9e\x9f\xaa\xb2\xb3\xb5\xb9\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\ xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\ xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\ xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' # above output wrapped at column 80

Unicode example:

>>> re.sub(ur'[\W_]+', u'', u'a_b A_Z \x80\xFF \u0404', flags=re.UNICODE) u'abAZ\xff\u0404'

answered Sep 30 '22 07:09

John Machin

In the char set matching rule [...] you can specify ^ as first char to mean "not in"

import re re.sub("[^0-9a-zA-Z]",        # Anything except 0..9, a..z and A..Z        "",                    # replaced with nothing        "this is a test!!")    # in this string  --> 'thisisatest'

answered Sep 30 '22 05:09

6502

Related questions
                            
                                What does clf mean in machine learning?
                            
                                How to exit Python script in Command Prompt?
                            
                                Seaborn - change bar color according to hue name
                            
                                Collatz Conjecture Python - Incorrect Output Above 2 Trillion (Only!)
                            
                                Incredibly basic lxml questions: getting HTML/string content of lxml.etree._Element?
                            
                                What is the difference between installing a package using pip vs. apt-get?
                            
                                Unicode error handling with Python 3's readlines()
                            
                                List comprehension vs generator expression's weird timeit results?
                            
                                How can I get a Python decorator to run after the decorated function has completed?
                            
                                Graceful shutdown of asyncio coroutines
                            
                                Resize rectangular image to square, keeping ratio and fill background with black
                            
                                Is there a Python module for converting RTF to plain text? [closed]
                            
                                Access memory address in python
                            
                                Expand tabs to spaces in vim only in python files?
                            
                                Adding indexes to SQLAlchemy models after table creation
                            
                                how to get field type string from db model in django
                            
                                Ranking order per group in Pandas
                            
                                RabbitMQ: How to send Python dictionary between Python producer and consumer?
                            
                                Include my markdown README into Sphinx
                            
                                join two lists of dictionaries on a single key

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python remove anything that is not a letter or number

Tags:

python

string

regex

Chris Dutrow

People also ask

2 Answers

John Machin

6502

Recent Activity

Donate For Us