I need to write a python script that removes every word in a text file with non alphabetical characters, in order to test Zipf's law. For example: <pre class="prettyprint"><code>asdf@gmail.com said: I've taken 2 reports to the boss </code></pre> to <pre class="prettyprint"><code>taken reports to the boss </code></pre> How should I proceed?

The <code>nltk</code> package is specialised in handling text and has various functions you can use to 'tokenize' text into words. You can either use the <code>RegexpTokenizer</code>, or the <code>word_tokenize</code> with a slight adaptation. The easiest and simplest is the <code>RegexpTokenizer</code>: <pre class="prettyprint"><code>import nltk text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things." result = nltk.RegexpTokenizer(r'\w+').tokenize(text) </code></pre> Which returns: <pre class="prettyprint"><code>`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']` </code></pre> Or you can use the slightly smarter <code>word_tokenize</code> which is able to split most contractions like <code>didn't</code> into <code>did</code> and <code>n't</code>. <pre class="prettyprint"><code>import re import nltk nltk.download('punkt') # You only have to do this once def contains_letters(phrase): return bool(re.search('[a-zA-Z]', phrase)) text = "asdf@gmail.com said: I've taken 2 reports to the boss. I didn't do the other things." result = [word for word in nltk.word_tokenize(text) if contains_letters(word)] </code></pre> which returns: <pre class="prettyprint"><code>['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things'] </code></pre>

How to remove every word with non alphabetic characters

Tags:

python

python-3.x

python-2.7

grammar

I need to write a python script that removes every word in a text file with non alphabetical characters, in order to test Zipf's law. For example:

[email protected] said: I've taken 2 reports to the boss

taken reports to the boss

How should I proceed?

275

asked Sep 29 '17 09:09

Norhther

4 Answers

Using regular expressions to match only letters (and underscores), you can do this:

import re

s = "[email protected] said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'

166

answered Oct 01 '22 01:10

user2390182

Try this:

sentence = "[email protected] said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']

result = ' '.join(words)
# taken reports to the boss

answered Oct 01 '22 02:10

CtheSky

You can use split() and is isalpha() to get a list of words who only have alphabetic characters AND there is at least one character.

>>> sentence = "[email protected] said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

You can then use join() to make the list into one string:

>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss

answered Oct 01 '22 02:10

Sash Sinha

The nltk package is specialised in handling text and has various functions you can use to 'tokenize' text into words.

You can either use the RegexpTokenizer, or the word_tokenize with a slight adaptation.

The easiest and simplest is the RegexpTokenizer:

import nltk

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

Which returns:

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`

Or you can use the slightly smarter word_tokenize which is able to split most contractions like didn't into did and n't.

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

which returns:

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']