I wrote a lemma tokenizer using spaCy for scikit-learn based on their example, it works OK standalone:
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
class LemmaTokenizer(object):
def __init__(self):
self.spacynlp = spacy.load('en')
def __call__(self, doc):
nlpdoc = self.spacynlp(doc)
nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
return nlpdoc
vect = TfidfVectorizer(tokenizer=LemmaTokenizer())
vect.fit(['Apples and oranges are tasty.'])
print(vect.vocabulary_)
### prints {'apple': 1, 'and': 0, 'tasty': 4, 'be': 2, 'orange': 3}
However, using it in GridSearchCV
gives errors, a self contained example is below:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
wordvect = TfidfVectorizer(analyzer='word', strip_accents='ascii', tokenizer=LemmaTokenizer())
classifier = OneVsRestClassifier(SVC(kernel='linear'))
pipeline = Pipeline([('vect', wordvect), ('classifier', classifier)])
parameters = {'vect__min_df': [1, 2], 'vect__max_df': [0.7, 0.8], 'classifier__estimator__C': [0.1, 1, 10]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=7, verbose=1)
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'), shuffle=True, categories=categories)
X = newsgroups.data
y = newsgroups.target
gs_clf = gs_clf.fit(X, y)
### AttributeError: 'spacy.tokenizer.Tokenizer' object has no attribute '_prefix_re'
The error does not appear when I load spacy outside of constructor of the tokenizer, then the GridSearchCV
runs:
spacynlp = spacy.load('en')
class LemmaTokenizer(object):
def __call__(self, doc):
nlpdoc = spacynlp(doc)
nlpdoc = [token.lemma_ for token in nlpdoc if (len(token.lemma_) > 1) or (token.lemma_.isalnum()) ]
return nlpdoc
But this means that every of my n_jobs
from the GridSearchCV
will access and call the same spacynlp object, it is shared among these jobs, which leaves the questions:
spacy.load('en')
safe to be used by multiple jobs in GridSearchCV?Scikit-learn's Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it's hard to know when to use which.
Explanation: vocabulary_ is a dict where keys are terms and values are indices in the feature matrix. CountVectorizer converts a collection of text documents to a matrix of token counts. It produces a sparse Matrix of the counts of each word from the vocabulary.
The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.
Installation. spacytextblob is a pipeline component that enables sentiment analysis using the TextBlob library.
You are wasting time by running Spacy for each parameter setting in the grid. The memory overhead is also significant. You should run all data through Spacy once and save it to disk, then use a simplified vectoriser that reads in pre-lemmatised data. Look at the tokenizer
, analyser
and preprocessor
parameters of TfidfVectorizer
. There are plenty of examples on stack overflow that show how to build a custom vectoriser.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With