I need to write a python script that removes every word in a text file with non alphabetical characters, in order to test Zipf's law. For example:
[email protected] said: I've taken 2 reports to the boss
to
taken reports to the boss
How should I proceed?
To remove all non-alphanumeric characters from a string, call the replace() method, passing it a regular expression that matches all non-alphanumeric characters as the first parameter and an empty string as the second. The replace method returns a new string with all matches replaced.
You can also use [^\w] regular expression, which is equivalent to [^a-zA-Z_0-9] . It will replace characters that are not present in the character range A-Z , a-z , 0-9 , _ . Alternatively, you can use the character class \W that directly matches with any non-word character, i.e., [a-zA-Z_0-9] .
sub() method to remove all non-alphabetic characters from a string, e.g. new_str = re. sub(r'[^a-zA-Z]', '', my_str) . The re. sub() method will remove all non-alphabetic characters from the string by replacing them with empty strings.
Using regular expressions to match only letters (and underscores), you can do this:
import re
s = "[email protected] said: I've taken 2 reports to the boss"
# s = open('text.txt').read()
tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
Try this:
sentence = "[email protected] said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']
result = ' '.join(words)
# taken reports to the boss
You can use split() and is isalpha() to get a list of words who only have alphabetic characters AND there is at least one character.
>>> sentence = "[email protected] said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']
You can then use join() to make the list into one string:
>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss
The nltk
package is specialised in handling text and has various functions you can use to 'tokenize' text into words.
You can either use the RegexpTokenizer
, or the word_tokenize
with a slight adaptation.
The easiest and simplest is the RegexpTokenizer
:
import nltk
text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."
result = nltk.RegexpTokenizer(r'\w+').tokenize(text)
Which returns:
`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`
Or you can use the slightly smarter word_tokenize
which is able to split most contractions like didn't
into did
and n't
.
import re
import nltk
nltk.download('punkt') # You only have to do this once
def contains_letters(phrase):
return bool(re.search('[a-zA-Z]', phrase))
text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."
result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]
which returns:
['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With