I added lemmatization to my countvectorizer, as explained on this Sklearn page.
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, articles):
return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
strip_accents = 'unicode',
stop_words = 'english',
lowercase = True,
token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
max_df = 0.5,
min_df = 10)
However, when creating a dtm using fit_transform
, I get the error below (of which I can't make sense). Before adding the lemmatization to my vectorizer, the dtm code always worked. I went deeper into the manual, and tried some things with the code, but couldn't find any solution.
dtm_tf = tf_vectorizer.fit_transform(articles)
Update:
After following @MaxU's advice below, the code run without error, however numbers and punctuation were not ommited from my output. I run individual tests to see which of the other functions after LemmaTokenizer()
do and do not work. Here is the result:
strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase = True, # works
token_pattern = r'\b[a-zA-Z]{3,}\b', # does not work
max_df = 0.5, # works
min_df = 10 # works
Appearantly, it is just token_pattern
which became inactive. Here is the updated and working code without token_pattern
(I just needed to install the 'punkt' and 'wordnet' packages first):
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
class LemmaTokenizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, articles):
return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase = True, # works
max_df = 0.5, # works
min_df = 10) # works
For those who want to remove digits, punctuation and words of less than 3 characters (but have no idea how), here is one way that does it for me when working from Pandas dataframe
# when working from Pandas dataframe
df['TEXT'] = df['TEXT'].str.replace('\d+', '') # for digits
df['TEXT'] = df['TEXT'].str.replace(r'(\b\w{1,2}\b)', '') # for words
df['TEXT'] = df['TEXT'].str.replace('[^\w\s]', '') # for punctuation
I added lemmatization to my countvectorizer, as explained on this Sklearn page. However, when creating a dtm using fit_transform , I get the error below (of which I can't make sense). Before adding the lemmatization to my vectorizer, the dtm code always worked.
CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.
In order to lemmatize, you need to create an instance of the WordNetLemmatizer() and call the lemmatize() function on a single word. Let's lemmatize a simple sentence. We first tokenize the sentence into words using nltk. word_tokenize and then we will call lemmatizer.
Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.
It should be:
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
# NOTE: ----------------------> ^^
instead of:
tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With