Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python, remove all non-alphabet chars from string

Tags:

python

regex

I am writing a python MapReduce word count program. Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it

def mapfn(k, v):     print v     import re, string      pattern = re.compile('[\W_]+')     v = pattern.match(v)     print v     for w in v.split():         yield w, 1 

I'm afraid I am not sure how to use the library re or even regex for that matter. I am not sure how to apply the regex pattern to the incoming string (line of a book) v properly to retrieve the new line without any non-alphanumeric chars.

Suggestions?

like image 973
KDecker Avatar asked Mar 20 '14 00:03

KDecker


People also ask

How do I remove all non alphanumeric characters from a string?

A common solution to remove all non-alphanumeric characters from a String is with regular expressions. The idea is to use the regular expression [^A-Za-z0-9] to retain only alphanumeric characters in the string. You can also use [^\w] regular expression, which is equivalent to [^a-zA-Z_0-9] .

How do I remove non alphabets from a string?

To remove all non-alphanumeric characters from a string, call the replace() method, passing it a regular expression that matches all non-alphanumeric characters as the first parameter and an empty string as the second. The replace method returns a new string with all matches replaced.

How do you find a non alphanumeric character in Python?

Python String isalnum() Method The isalnum() method returns True if all the characters are alphanumeric, meaning alphabet letter (a-z) and numbers (0-9). Example of characters that are not alphanumeric: (space)!


2 Answers

Use re.sub

import re  regex = re.compile('[^a-zA-Z]') #First parameter is the replacement, second parameter is your input string regex.sub('', 'ab3d*E') #Out: 'abdE' 

Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)

regex = re.compile('[,\.!?]') #etc. 
like image 106
limasxgoesto0 Avatar answered Oct 18 '22 08:10

limasxgoesto0


If you prefer not to use regex, you might try

''.join([i for i in s if i.isalpha()]) 
like image 37
Tad Avatar answered Oct 18 '22 08:10

Tad