I know similar questions were asked around here on StackOverflow. I tryed to adapt some of the approaches but I couldn't get anything to work, that fits my needs:
Given a python string I want to strip every non alpha numeric charater - but - leaving any special charater like µ æ Å Ç ß... Is this even possible? with regexes I tryed variations of this
re.sub(r'[^a-zA-Z0-9: ]', '', x) # x is my string to sanitize
but it strips me more then I want. An example of what I want would be:
Input: "A string, with characters µ, æ, Å, Ç, ß,... Some whitespace confusion ?"
Output: "A string with characters µ æ Å Ç ß Some whitespace confusion"
Is this even possible without getting complicated?
Use \w with the UNICODE flag set. This will match the underscore also, so you might need to take care of that separately.
Details on http://docs.python.org/library/re.html.
EDIT: Here is some actual code. It will keep unicode letters, unicode digits, and spaces.
import re
x = u'$a_bßπ7: ^^@p'
pattern = re.compile(r'[^\w\s]', re.U)
re.sub(r'_', '', re.sub(pattern, '', x))
If you did not use re.U then the ß and π characters would have been stripped.
Sorry I can't figure out a way to do this with one regex. If you can, can you post a solution?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With