I added lemmatization to my countvectorizer, as explained on this Sklearn page. <pre class="prettyprint"><code>from nltk import word_tokenize from nltk.stem import WordNetLemmatizer class LemmaTokenizer(object): def __init__(self): self.wnl = WordNetLemmatizer() def __call__(self, articles): return [self.wnl.lemmatize(t) for t in word_tokenize(articles)] tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer, strip_accents = 'unicode', stop_words = 'english', lowercase = True, token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters max_df = 0.5, min_df = 10) </code></pre> However, when creating a dtm using <code>fit_transform</code>, I get the error below (of which I can't make sense). Before adding the lemmatization to my vectorizer, the dtm code always worked. I went deeper into the manual, and tried some things with the code, but couldn't find any solution. <pre class="prettyprint"><code>dtm_tf = tf_vectorizer.fit_transform(articles) </code></pre> Update: After following @MaxU's advice below, the code run without error, however numbers and punctuation were not ommited from my output. I run individual tests to see which of the other functions after <code>LemmaTokenizer()</code> do and do not work. Here is the result: <pre class="prettyprint"><code>strip_accents = 'unicode', # works stop_words = 'english', # works lowercase = True, # works token_pattern = r'\b[a-zA-Z]{3,}\b', # does not work max_df = 0.5, # works min_df = 10 # works </code></pre> Appearantly, it is just <code>token_pattern</code> which became inactive. Here is the updated and working code without <code>token_pattern</code> (I just needed to install the 'punkt' and 'wordnet' packages first): <pre class="prettyprint"><code>from nltk import word_tokenize from nltk.stem import WordNetLemmatizer class LemmaTokenizer(object): def __init__(self): self.wnl = WordNetLemmatizer() def __call__(self, articles): return [self.wnl.lemmatize(t) for t in word_tokenize(articles)] tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(), strip_accents = 'unicode', # works stop_words = 'english', # works lowercase = True, # works max_df = 0.5, # works min_df = 10) # works </code></pre> For those who want to remove digits, punctuation and words of less than 3 characters (but have no idea how), here is one way that does it for me when working from Pandas dataframe <pre class="prettyprint"><code># when working from Pandas dataframe df['TEXT'] = df['TEXT'].str.replace('\d+', '') # for digits df['TEXT'] = df['TEXT'].str.replace(r'(\b\w{1,2}\b)', '') # for words df['TEXT'] = df['TEXT'].str.replace('[^\w\s]', '') # for punctuation </code></pre>

It should be: <pre class="prettyprint"><code>tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(), # NOTE: ----------------------> ^^ </code></pre> instead of: <pre class="prettyprint"><code>tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer, </code></pre>

Sklearn: adding lemmatizer to CountVectorizer

Tags:

python

scikit-learn

lemmatization

countvectorizer

I added lemmatization to my countvectorizer, as explained on this Sklearn page.

Click to copy

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,
                       strip_accents = 'unicode',
                       stop_words = 'english',
                       lowercase = True,
                       token_pattern = r'\b[a-zA-Z]{3,}\b', # keeps words of 3 or more characters
                       max_df = 0.5,
                       min_df = 10)

However, when creating a dtm using fit_transform, I get the error below (of which I can't make sense). Before adding the lemmatization to my vectorizer, the dtm code always worked. I went deeper into the manual, and tried some things with the code, but couldn't find any solution.

Click to copy

dtm_tf = tf_vectorizer.fit_transform(articles)

Update:

After following @MaxU's advice below, the code run without error, however numbers and punctuation were not ommited from my output. I run individual tests to see which of the other functions after LemmaTokenizer() do and do not work. Here is the result:

Click to copy

strip_accents = 'unicode', # works
stop_words = 'english', # works
lowercase = True, # works
token_pattern = r'\b[a-zA-Z]{3,}\b', # does not work
max_df = 0.5, # works
min_df = 10 # works

Appearantly, it is just token_pattern which became inactive. Here is the updated and working code without token_pattern (I just needed to install the 'punkt' and 'wordnet' packages first):

Click to copy

from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
                                strip_accents = 'unicode', # works 
                                stop_words = 'english', # works
                                lowercase = True, # works
                                max_df = 0.5, # works
                                min_df = 10) # works

For those who want to remove digits, punctuation and words of less than 3 characters (but have no idea how), here is one way that does it for me when working from Pandas dataframe

Click to copy

# when working from Pandas dataframe

df['TEXT'] = df['TEXT'].str.replace('\d+', '') # for digits
df['TEXT'] = df['TEXT'].str.replace(r'(\b\w{1,2}\b)', '') # for words
df['TEXT'] = df['TEXT'].str.replace('[^\w\s]', '') # for punctuation

280

asked Nov 21 '17 22:11

Rens

1 Answers

It should be:

Click to copy

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer(),
# NOTE:                        ---------------------->  ^^

instead of:

Click to copy

tf_vectorizer = CountVectorizer(tokenizer=LemmaTokenizer,

answered Sep 17 '22 14:09

MaxU - stop WAR against UA

Related questions
                            
                                Can pandas.DataFrame have list type column?
                            
                                How to save and load MLLib model in Apache Spark?
                            
                                Add metadata comment to Numpy ndarray
                            
                                How to use technical indicators of TA-Lib with pandas in python
                            
                                How to send a colored text message?
                            
                                Jupyter: Write a custom magic that modifies the contents of the cell it's in
                            
                                zip_longest without fillvalue
                            
                                How to optimize multiprocessing in Python
                            
                                How to split a list into n groups in all possible combinations of group length and elements within group?
                            
                                Spyder 3 "Set Console Working Directory" not working
                            
                                How do I feed Tensorflow placeholders with numpy arrays?
                            
                                What should I put in the body of an abstract method?
                            
                                What's the difference between dummy variable and one-hot encoding?
                            
                                TypeError: init() missing 1 required positional argument: 'message' using Multiprocessing
                            
                                Pipe PIL images to ffmpeg stdin - Python
                            
                                Python Requests - ChunkedEncodingError(e) - requests.iter_lines
                            
                                pip-selfcheck.json with virtualenv
                            
                                How to generate n-level hierarchical JSON from pandas DataFrame?
                            
                                opencv - cropping handwritten lines (line segmentation)
                            
                                Add top level argparse arguments after subparser args

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sklearn: adding lemmatizer to CountVectorizer

Tags:

python

scikit-learn

lemmatization

countvectorizer

Rens

People also ask

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us