spaCy and scikit-learn vectorizer

Tags:

I wrote a lemma tokenizer using spaCy for scikit-learn based on their example, it works OK standalone:

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.spacynlp = spacy.load('en')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
        return nlpdoc

vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}

However, using it in GridSearchCV gives errors, a self contained example is below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)

### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'

The error does not appear when I load spacy outside of constructor of the tokenizer, then the GridSearchCV runs:

spacynlp = spacy.load('en')
    class LemmaTokenizer(object):
        def __call__(self, doc):
            nlpdoc = spacynlp(doc)
            nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
            return nlpdoc

But this means that every of my n_jobs from the GridSearchCV will access and call the same spacynlp object, it is shared among these jobs, which leaves the questions:

Is the spacynlp object from spacy.load('en') safe to be used by multiple jobs in GridSearchCV?
Is this the correct way to implement calls to spacy inside a tokenizer for scikit-learn?

469

asked Jul 19 '17 16:07

tkja

1 Answers

You are wasting time by running Spacy for each parameter setting in the grid. The memory overhead is also significant. You should run all data through Spacy once and save it to disk, then use a simplified vectoriser that reads in pre-lemmatised data. Look at the tokenizer, analyser and preprocessor parameters of TfidfVectorizer. There are plenty of examples on stack overflow that show how to build a custom vectoriser.

166

answered Oct 14 '22 20:10

mbatchkarov

Related questions
                            
                                Can't quit Python script with Ctrl-C if a thread ran webbrowser.open()
                            
                                How to test for sequences that are not string-like using Python 3's standard library
                            
                                python linux - display image with filename as viewer window title
                            
                                Recommended way to package a Django project? Django package with multiple apps or multiple Django packages?
                            
                                Strange behavior in Python, Line missing, different outputs
                            
                                Can signal handlers memory leak in PyQt? [duplicate]
                            
                                Keras CNN, verbose training progress bar display
                            
                                pyplot plot freezes (not responding)
                            
                                Airflow - long running task in SubDag marked as failed after an hour
                            
                                Long Sequence In a seq2seq model with attention?
                            
                                Basic multi GPU parallelization of matrix multiplication
                            
                                Using spaCy to replace the "topic" of a sentence
                            
                                Rotate square to be normal to a vector
                            
                                Keras + Tensorflow : Debug NaNs
                            
                                ImportError: cannot import name 'IntEnum'
                            
                                Can tqdm be embedded to html?
                            
                                How to wrap templated classes with pybind11
                            
                                500 error while trying to enable CORS on POST with AWS API Gateway Proxy Integration
                            
                                HTTPS through proxy completely encrypted, including SSL CONNECT
                            
                                How to Retrieve Original Variables After Scikit Model Run w/OneHotEncoding

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spaCy and scikit-learn vectorizer

Tags:

python

nlp

scikit-learn

spacy

tkja

People also ask

1 Answers

mbatchkarov

Recent Activity

Donate For Us