Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spaCy and scikit-learn vectorizer

I wrote a lemma tokenizer using spaCy for scikit-learn based on their example, it works OK standalone:

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.spacynlp = spacy.load('en')
    def __call__(self, doc):
        nlpdoc = self.spacynlp(doc)
        nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
        return nlpdoc

vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}

However, using it in GridSearchCV gives errors, a self contained example is below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)

from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)

### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'

The error does not appear when I load spacy outside of constructor of the tokenizer, then the GridSearchCV runs:

spacynlp = spacy.load('en')
    class LemmaTokenizer(object):
        def __call__(self, doc):
            nlpdoc = spacynlp(doc)
            nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
            return nlpdoc

But this means that every of my n_jobs from the GridSearchCV will access and call the same spacynlp object, it is shared among these jobs, which leaves the questions:

  1. Is the spacynlp object from spacy.load('en') safe to be used by multiple jobs in GridSearchCV?
  2. Is this the correct way to implement calls to spacy inside a tokenizer for scikit-learn?
like image 469
tkja Avatar asked Jul 19 '17 16:07

tkja


People also ask

What is Tfidfvectorizer in Sklearn?

Scikit-learn's Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it's hard to know when to use which.

What is Vectorizer Vocabulary_?

Explanation: vocabulary_ is a dict where keys are terms and values are indices in the feature matrix. CountVectorizer converts a collection of text documents to a matrix of token counts. It produces a sparse Matrix of the counts of each word from the vocabulary.

How do you use count Vectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.

Can spaCy do sentiment analysis?

Installation. spacytextblob is a pipeline component that enables sentiment analysis using the TextBlob library.


1 Answers

You are wasting time by running Spacy for each parameter setting in the grid. The memory overhead is also significant. You should run all data through Spacy once and save it to disk, then use a simplified vectoriser that reads in pre-lemmatised data. Look at the tokenizer, analyser and preprocessor parameters of TfidfVectorizer. There are plenty of examples on stack overflow that show how to build a custom vectoriser.

like image 166
mbatchkarov Avatar answered Oct 14 '22 20:10

mbatchkarov