How can I preprocess NLP text (lowercase, remove special characters, remove numbers, remove emails, etc) in one pass using Python?
Here are all the things I want to do to a Pandas dataframe in one pass in python:
1. Lowercase text
2. Remove whitespace
3. Remove numbers
4. Remove special characters
5. Remove emails
6. Remove stop words
7. Remove NAN
8. Remove weblinks
9. Expand contractions (if possible not necessary)
10. Tokenize
Here's how I am doing it all individually:
def preprocess(self, dataframe):
self.log.info("In preprocess function.")
dataframe1 = self.remove_nan(dataframe)
dataframe2 = self.lowercase(dataframe1)
dataframe3 = self.remove_whitespace(dataframe2)
# Remove emails and websites before removing special characters
dataframe4 = self.remove_emails(self, dataframe3)
dataframe5 = self.remove_website_links(self, dataframe4)
dataframe6 = self.remove_special_characters(dataframe5)
dataframe7 - self.remove_numbers(dataframe6)
self.remove_stop_words(dataframe8) # Doesn't return anything for now
dataframe7 = self.tokenize(dataframe6)
self.log.info(f"Sample of preprocessed data: {dataframe4.head()}")
return dataframe7
def remove_nan(self, dataframe):
"""Pass in a dataframe to remove NAN from those columns."""
return dataframe.dropna()
def lowercase(self, dataframe):
logging.info("Converting dataframe to lowercase")
lowercase_dataframe = dataframe.apply(lambda x: x.lower())
return lowercase_dataframe
def remove_special_characters(self, dataframe):
self.log.info("Removing special characters from dataframe")
no_special_characters = dataframe.replace(r'[^A-Za-z0-9 ]+', '', regex=True)
return no_special_characters
def remove_numbers(self, dataframe):
self.log.info("Removing numbers from dataframe")
removed_numbers = dataframe.str.replace(r'\d+','')
return removed_numbers
def remove_whitespace(self, dataframe):
self.log.info("Removing whitespace from dataframe")
# replace more than 1 space with 1 space
merged_spaces = dataframe.str.replace(r"\s\s+",' ')
# delete beginning and trailing spaces
trimmed_spaces = merged_spaces.apply(lambda x: x.str.strip())
return trimmed_spaces
def remove_stop_words(self, dataframe):
# TODO: An option to pass in a custom list of stopwords would be cool.
set(stopwords.words('english'))
def remove_website_links(self, dataframe):
self.log.info("Removing website links from dataframe")
no_website_links = dataframe.str.replace(r"http\S+", "")
return no_website_links
def tokenize(self, dataframe):
tokenized_dataframe = dataframe.apply(lambda row: word_tokenize(row))
return tokenized_dataframe
def remove_emails(self, dataframe):
no_emails = dataframe.str.replace(r"\S*@\S*\s?")
return no_emails
def expand_contractions(self, dataframe):
# TODO: Not a priority right now. Come back to it later.
return dataframe
Remove Special Characters Including Strings Using Python isalnum. Python has a special string method, . isalnum() , which returns True if the string is an alpha-numeric character, and returns False if it is not. We can use this, to loop over a string and append, to a new string, only alpha-numeric characters.
So it is better to remove them than to keep them. For example, when we are doing sentiment analysis then the number doesn't hold any specific meaning to the data but if the task is to perform NER (Name Entity Recognition) or POS (Part of Speech tagging) then use the removing of number technique carefully.
Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed; the results are seen. In NLP, text preprocessing is the first step in the process of building a model. The various text preprocessing steps are: Tokenization. Lower casing.
How can I preprocess NLP text (lowercase, remove special characters, remove numbers, remove emails, etc) in one pass using Python? Here are all the things I want to do to a Pandas dataframe in one pass in python: 1. Lowercase text 2. Remove whitespace 3. Remove numbers 4. Remove special characters 5. Remove emails 6. Remove stop words 7.
As we said before text preprocessing is the first step in the Natural Language Processing pipeline. The importance of preprocessing is increasing in NLP due to noise or unclear data extracted or collected from different sources.
In this article, most of the text preprocessing techniques are explained. We do not need to perform all preprocessing techniques. Just download the file and import the file in our code. All function with a list of sentences and a list of text preprocessing techniques.
Example text has incorrect spelling words to check whether the spell_correction function gives correct words or not. We can observe both methods given correct or expected solutions. This is another common preprocessing technique in NLP.
The following function performs all things you have mentioned.
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
def preprocess(sentence):
sentence=str(sentence)
sentence = sentence.lower()
sentence=sentence.replace('{html}',"")
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', sentence)
rem_url=re.sub(r'http\S+', '',cleantext)
rem_num = re.sub('[0-9]+', '', rem_url)
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(rem_num)
filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
stem_words=[stemmer.stem(w) for w in filtered_words]
lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
return " ".join(filtered_words)
df['cleanText']=df['Text'].map(lambda s:preprocess(s))
I decided to use Dask, which allows you to parallelize Python tasks on your local computer and works well with Pandas, numpy, and scikitlearn: http://docs.dask.org/en/latest/why.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With