Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to skip some specific characters

I am trying to clean the string such that it does not have any punctuation or number, it must only have a-z and A-Z. For example,given String is:

"coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"

Required output is :

['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

My solution is

re.findall(r"([A-Za-z]+)" ,string)

My output is

['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']
like image 594
Raja Hammad Farooq Avatar asked Jan 05 '23 10:01

Raja Hammad Farooq


1 Answers

You don't need to use regular expression:

(Convert the string into lower case if you want all lower-cased words), Split words, then filter out word that starts with alphabet:

>>> s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()]
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

In Python 3.x, filter(str.isalpha, word) should be replaced with ''.join(filter(str.isalpha, word)), because in Python 3.x, filter returns a filter object.

like image 136
falsetru Avatar answered Jan 13 '23 20:01

falsetru