I am trying to clean the string such that it does not have any punctuation or number, it must only have a-z and A-Z. For example,given String is:
"coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
Required output is :
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
My solution is
re.findall(r"([A-Za-z]+)" ,string)
My output is
['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']
You don't need to use regular expression:
(Convert the string into lower case if you want all lower-cased words), Split words, then filter out word that starts with alphabet:
>>> s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()]
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
In Python 3.x, filter(str.isalpha, word)
should be replaced with ''.join(filter(str.isalpha, word))
, because in Python 3.x, filter
returns a filter object.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With