Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip Non alpha numeric characters from string in python but keeping special characters

I know similar questions were asked around here on StackOverflow. I tryed to adapt some of the approaches but I couldn't get anything to work, that fits my needs:

Given a python string I want to strip every non alpha numeric charater - but - leaving any special charater like µ æ Å Ç ß... Is this even possible? with regexes I tryed variations of this

re.sub(r'[^a-zA-Z0-9: ]', '', x) # x is my string to sanitize

but it strips me more then I want. An example of what I want would be:

Input:  "A string, with characters µ, æ, Å, Ç, ß,... Some    whitespace  confusion  ?"
Output: "A string with characters µ æ Å Ç ß Some whitespace confusion"

Is this even possible without getting complicated?

like image 631
Aufwind Avatar asked Feb 24 '23 15:02

Aufwind


1 Answers

Use \w with the UNICODE flag set. This will match the underscore also, so you might need to take care of that separately.

Details on http://docs.python.org/library/re.html.

EDIT: Here is some actual code. It will keep unicode letters, unicode digits, and spaces.

import re
x = u'$a_bßπ7: ^^@p'
pattern = re.compile(r'[^\w\s]', re.U)
re.sub(r'_', '', re.sub(pattern, '', x))

If you did not use re.U then the ß and π characters would have been stripped.

Sorry I can't figure out a way to do this with one regex. If you can, can you post a solution?

like image 118
Ray Toal Avatar answered Feb 26 '23 16:02

Ray Toal