I have this line to remove all non-alphanumeric characters except spaces
re.sub(r'\W+', '', s)
Although, it still keeps non-English characters.
For example if I have
re.sub(r'\W+', '', 'This is a sentence, and here are non-english 托利 苏 !!11')
I want to get as output:
> 'This is a sentence and here are non-english 11'
Using isalnum() function Another option is to filter the string that matches with the isalnum() function. It returns true if all characters in the string are alphanumeric, false otherwise.
sub() method to remove all non-alphabetic characters from a string, e.g. new_str = re. sub(r'[^a-zA-Z]', '', my_str) . The re. sub() method will remove all non-alphabetic characters from the string by replacing them with empty strings.
Python String isalnum() MethodThe isalnum() method returns True if all the characters are alphanumeric, meaning alphabet letter (a-z) and numbers (0-9). Example of characters that are not alphanumeric: (space)!
The approach is to use the String. replaceAll method to replace all the non-alphanumeric characters with an empty string.
re.sub(r'[^A-Za-z0-9 ]+', '', s)
(Edit) To clarify:
The []
create a list of chars. The ^
negates the list. A-Za-z
are the English alphabet and is space. For any one or more of these (that is, anything that is not A-Z, a-z, or space,) replace with the empty string.
This might not be an answer to this concrete question but i came across this thread during my research.
I wanted to reach the same objective as the questioner but I wanted to include non English characters such as: ä,ü,ß, ...
The way the questioners code works, spaces will be deleted too.
A simple workaround is the following:
re.sub(r'[^ \w+]', '', string)
The ^ implies that everything but the following is selected. In this case \w, thus every word character (including non-English), and spaces.
I hope this will help someone in the future
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With