Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - keep only alphanumeric and space, and ignore non-ASCII

Tags:

python

regex

I have this line to remove all non-alphanumeric characters except spaces

re.sub(r'\W+', '', s)

Although, it still keeps non-English characters.

For example if I have

re.sub(r'\W+', '', 'This is a sentence, and here are non-english 托利 苏 !!11')

I want to get as output:

> 'This is a sentence and here are non-english  11'
like image 775
Filipe Avatar asked Apr 29 '19 11:04

Filipe


People also ask

How do I keep only alphanumeric in Python?

Using isalnum() function Another option is to filter the string that matches with the isalnum() function. It returns true if all characters in the string are alphanumeric, false otherwise.

How do I get rid of non alphabetic characters in Python?

sub() method to remove all non-alphabetic characters from a string, e.g. new_str = re. sub(r'[^a-zA-Z]', '', my_str) . The re. sub() method will remove all non-alphabetic characters from the string by replacing them with empty strings.

How do you only find alphanumeric characters in Python?

Python String isalnum() MethodThe isalnum() method returns True if all the characters are alphanumeric, meaning alphabet letter (a-z) and numbers (0-9). Example of characters that are not alphanumeric: (space)!

How do I remove non alphabetic characters from a string?

The approach is to use the String. replaceAll method to replace all the non-alphanumeric characters with an empty string.


2 Answers

re.sub(r'[^A-Za-z0-9 ]+', '', s)

(Edit) To clarify: The [] create a list of chars. The ^ negates the list. A-Za-z are the English alphabet and is space. For any one or more of these (that is, anything that is not A-Z, a-z, or space,) replace with the empty string.

like image 116
Nir Levy Avatar answered Dec 06 '22 03:12

Nir Levy


This might not be an answer to this concrete question but i came across this thread during my research.

I wanted to reach the same objective as the questioner but I wanted to include non English characters such as: ä,ü,ß, ...

The way the questioners code works, spaces will be deleted too.

A simple workaround is the following:

re.sub(r'[^ \w+]', '', string)

The ^ implies that everything but the following is selected. In this case \w, thus every word character (including non-English), and spaces.

I hope this will help someone in the future

like image 31
Tilman Böckenförde Avatar answered Dec 06 '22 01:12

Tilman Böckenförde