Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove every word with non alphabetic characters

I need to write a python script that removes every word in a text file with non alphabetical characters, in order to test Zipf's law. For example:

[email protected] said: I've taken 2 reports to the boss

to

taken reports to the boss

How should I proceed?

like image 275
Norhther Avatar asked Sep 29 '17 09:09

Norhther


People also ask

How do I get rid of non-alphabetic characters?

To remove all non-alphanumeric characters from a string, call the replace() method, passing it a regular expression that matches all non-alphanumeric characters as the first parameter and an empty string as the second. The replace method returns a new string with all matches replaced.

How do I delete all alphanumeric characters?

You can also use [^\w] regular expression, which is equivalent to [^a-zA-Z_0-9] . It will replace characters that are not present in the character range A-Z , a-z , 0-9 , _ . Alternatively, you can use the character class \W that directly matches with any non-word character, i.e., [a-zA-Z_0-9] .

How do I remove all non-alphabetic characters in a string python?

sub() method to remove all non-alphabetic characters from a string, e.g. new_str = re. sub(r'[^a-zA-Z]', '', my_str) . The re. sub() method will remove all non-alphabetic characters from the string by replacing them with empty strings.


4 Answers

Using regular expressions to match only letters (and underscores), you can do this:

import re

s = "[email protected] said: I've taken 2 reports to the boss"
# s = open('text.txt').read()

tokens = s.strip().split()
clean_tokens = [t for t in tokens if re.match(r'[^\W\d]*$', t)]
# ['taken', 'reports', 'to', 'the', 'boss']
clean_s = ' '.join(clean_tokens)
# 'taken reports to the boss'
like image 166
user2390182 Avatar answered Oct 01 '22 01:10

user2390182


Try this:

sentence = "[email protected] said: I've taken 2 reports to the boss"
words = [word for word in sentence.split() if word.isalpha()]
# ['taken', 'reports', 'to', 'the', 'boss']

result = ' '.join(words)
# taken reports to the boss
like image 32
CtheSky Avatar answered Oct 01 '22 02:10

CtheSky


You can use split() and is isalpha() to get a list of words who only have alphabetic characters AND there is at least one character.

>>> sentence = "[email protected] said: I've taken 2 reports to the boss"
>>> alpha_words = [word for word in sentence.split() if word.isalpha()]
>>> print(alpha_words)
['taken', 'reports', 'to', 'the', 'boss']

You can then use join() to make the list into one string:

>>> alpha_only_string = " ".join(alpha_words)
>>> print(alpha_only_string)
taken reports to the boss
like image 29
Sash Sinha Avatar answered Oct 01 '22 02:10

Sash Sinha


The nltk package is specialised in handling text and has various functions you can use to 'tokenize' text into words.

You can either use the RegexpTokenizer, or the word_tokenize with a slight adaptation.

The easiest and simplest is the RegexpTokenizer:

import nltk

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."

result = nltk.RegexpTokenizer(r'\w+').tokenize(text)

Which returns:

`['asdf', 'gmail', 'com', 'said', 'I', 've', 'taken', '2', 'reports', 'to', 'the', 'boss', 'I', 'didn', 't', 'do', 'the', 'other', 'things']`

Or you can use the slightly smarter word_tokenize which is able to split most contractions like didn't into did and n't.

import re
import nltk
nltk.download('punkt')  # You only have to do this once

def contains_letters(phrase):
    return bool(re.search('[a-zA-Z]', phrase))

text = "[email protected] said: I've taken 2 reports to the boss. I didn't do the other things."

result = [word for word in nltk.word_tokenize(text) if contains_letters(word)]

which returns:

['asdf', 'gmail.com', 'said', 'I', "'ve", 'taken', 'reports', 'to', 'the', 'boss', 'I', 'did', "n't", 'do', 'the', 'other', 'things']
like image 20
Swier Avatar answered Oct 01 '22 01:10

Swier