Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I preprocess NLP text (lowercase, remove special characters, remove numbers, remove emails, etc) in one pass?

Tags:

python

pandas

nlp

How can I preprocess NLP text (lowercase, remove special characters, remove numbers, remove emails, etc) in one pass using Python?

Here are all the things I want to do to a Pandas dataframe in one pass in python:
1. Lowercase text
2. Remove whitespace
3. Remove numbers
4. Remove special characters
5. Remove emails
6. Remove stop words
7. Remove NAN
8. Remove weblinks
9. Expand contractions (if possible not necessary)
10. Tokenize

Here's how I am doing it all individually:

    def preprocess(self, dataframe):


    self.log.info("In preprocess function.")

    dataframe1 = self.remove_nan(dataframe)
    dataframe2 = self.lowercase(dataframe1)
    dataframe3 = self.remove_whitespace(dataframe2)

    # Remove emails and websites before removing special characters
    dataframe4 = self.remove_emails(self, dataframe3)
    dataframe5 = self.remove_website_links(self, dataframe4)

    dataframe6 = self.remove_special_characters(dataframe5)
    dataframe7 - self.remove_numbers(dataframe6)
    self.remove_stop_words(dataframe8) # Doesn't return anything for now
    dataframe7 = self.tokenize(dataframe6)

    self.log.info(f"Sample of preprocessed data: {dataframe4.head()}")

    return dataframe7

def remove_nan(self, dataframe):
    """Pass in a dataframe to remove NAN from those columns."""
    return dataframe.dropna()

def lowercase(self, dataframe):
    logging.info("Converting dataframe to lowercase")
    lowercase_dataframe = dataframe.apply(lambda x: x.lower())
    return lowercase_dataframe


def remove_special_characters(self, dataframe):
    self.log.info("Removing special characters from dataframe")
    no_special_characters = dataframe.replace(r'[^A-Za-z0-9 ]+', '', regex=True)
    return no_special_characters

def remove_numbers(self, dataframe):
    self.log.info("Removing numbers from dataframe")
    removed_numbers = dataframe.str.replace(r'\d+','')
    return removed_numbers

def remove_whitespace(self, dataframe):
    self.log.info("Removing whitespace from dataframe")
    # replace more than 1 space with 1 space
    merged_spaces = dataframe.str.replace(r"\s\s+",' ')
    # delete beginning and trailing spaces
    trimmed_spaces = merged_spaces.apply(lambda x: x.str.strip())
    return trimmed_spaces

def remove_stop_words(self, dataframe):
    # TODO: An option to pass in a custom list of stopwords would be cool.
    set(stopwords.words('english'))

def remove_website_links(self, dataframe):
    self.log.info("Removing website links from dataframe")
    no_website_links = dataframe.str.replace(r"http\S+", "")
    return no_website_links

def tokenize(self, dataframe):
    tokenized_dataframe = dataframe.apply(lambda row: word_tokenize(row))
    return tokenized_dataframe

def remove_emails(self, dataframe):
    no_emails = dataframe.str.replace(r"\S*@\S*\s?")
    return no_emails

def expand_contractions(self, dataframe):
    # TODO: Not a priority right now. Come back to it later.
    return dataframe
like image 278
pr338 Avatar asked Jan 28 '19 06:01

pr338


People also ask

How do I remove special characters from a string in Python NLP?

Remove Special Characters Including Strings Using Python isalnum. Python has a special string method, . isalnum() , which returns True if the string is an alpha-numeric character, and returns False if it is not. We can use this, to loop over a string and append, to a new string, only alpha-numeric characters.

Should we remove numbers in NLP?

So it is better to remove them than to keep them. For example, when we are doing sentiment analysis then the number doesn't hold any specific meaning to the data but if the task is to perform NER (Name Entity Recognition) or POS (Part of Speech tagging) then use the removing of number technique carefully.

What is the need of preprocessing for text data in natural language?

Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed; the results are seen. In NLP, text preprocessing is the first step in the process of building a model. The various text preprocessing steps are: Tokenization. Lower casing.

How can I preprocess NLP text(lowercase) in one pass using Python?

How can I preprocess NLP text (lowercase, remove special characters, remove numbers, remove emails, etc) in one pass using Python? Here are all the things I want to do to a Pandas dataframe in one pass in python: 1. Lowercase text 2. Remove whitespace 3. Remove numbers 4. Remove special characters 5. Remove emails 6. Remove stop words 7.

What is text preprocessing in NLP?

As we said before text preprocessing is the first step in the Natural Language Processing pipeline. The importance of preprocessing is increasing in NLP due to noise or unclear data extracted or collected from different sources.

Do we need to perform all the text preprocessing techniques?

In this article, most of the text preprocessing techniques are explained. We do not need to perform all preprocessing techniques. Just download the file and import the file in our code. All function with a list of sentences and a list of text preprocessing techniques.

Why does the text has incorrect spelling words in NLP?

Example text has incorrect spelling words to check whether the spell_correction function gives correct words or not. We can observe both methods given correct or expected solutions. This is another common preprocessing technique in NLP.


2 Answers

The following function performs all things you have mentioned.

import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer() 

 def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


df['cleanText']=df['Text'].map(lambda s:preprocess(s)) 
like image 70
Ravikiran Avatar answered Sep 30 '22 08:09

Ravikiran


I decided to use Dask, which allows you to parallelize Python tasks on your local computer and works well with Pandas, numpy, and scikitlearn: http://docs.dask.org/en/latest/why.html

like image 20
pr338 Avatar answered Sep 30 '22 08:09

pr338