I'm getting the following error when performing recursive feature selection with cross-validation:
Traceback (most recent call last):
File "/Users/.../srl/main.py", line 32, in <module>
argident_sys.train_classifier()
File "/Users/.../srl/identification.py", line 194, in train_classifier
feat_selector.fit(train_argcands_feats,train_argcands_target)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 298, in fit
ranking_ = rfe.fit(X[train], y[train]).ranking_
TypeError: only integer arrays with one element can be converted to an index
The code that generates the error is the following:
def train_classifier(self):
# Get the argument candidates
argcands = self.get_argcands(self.reader)
# Extract the necessary features from the argument candidates
train_argcands_feats = []
train_argcands_target = []
for argcand in argcands:
train_argcands_feats.append(self.extract_features(argcand))
if argcand["info"]["label"] == "NULL":
train_argcands_target.append("NULL")
else:
train_argcands_target.append("ARG")
# Transform the features to the format required by the classifier
self.feat_vectorizer = DictVectorizer()
train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)
# Transform the target labels to the format required by the classifier
self.target_names = list(set(train_argcands_target))
train_argcands_target = [self.target_names.index(target) for target in train_argcands_target]
## Train the appropriate supervised model
# Recursive Feature Elimination
self.classifier = LogisticRegression()
feat_selector = RFECV(estimator=self.classifier, step=1, cv=StratifiedKFold(train_argcands_target, 10))
feat_selector.fit(train_argcands_feats,train_argcands_target)
print feat_selector.n_features_
print feat_selector.support_
print feat_selector.ranking_
print feat_selector.cv_scores_
return
I know I should also perform GridSearch for the parameters of the LogisticRegression classifier, but I don't think that's the source of the error (or is it?).
I should mention that I'm testing with around 50 features, and almost all of them are categoric (that's why I use the DictVectorizer to transform them appropriately).
Any help or guidance you could give me is more than welcome. Thanks!
EDIT
Here's some training data examples:
train_argcands_feats = [{'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'dado', 'head': u'dado', 'head_postag': u'N'}, {'head_lemma': u'postura', 'head': u'postura', 'head_postag': u'N'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'de', 'head': u'de', 'head_postag': u'PRP'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'muito', 'head': u'Muitas', 'head_postag': u'PRON-DET'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}, {'head_lemma': u'com', 'head': u'com', 'head_postag': u'PRP'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}]
train_argcands_target = ['NULL', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'NULL', 'NULL']
Error encountered is: "TypeError: only integer scalar arrays can be converted to a scalar index". This error comes when we handle Numpy array in the wrong way using Numpy utility functions. The fix is to refer the documentation of the Numpy function and use the function correctly.
The error 'TypeError: only integer scalar arrays can be converted to a scalar index' occurs when you try to index an ordinary Python list with an array of values. If you want to index a list, you have to use single values.
I finally got to solve the problem. Two things had to be done:
Thanks to everyone who tried to help!
If anyone is still interested,
I used the CountVectorizer
on something very similar and it gave me the same error. I realized that the vectorizer gives me a COO sparse matrix which is basically a coordinate list. Elements in COO matrices can't be accessed through row indexes. It is best to convert it to a CSR matrix (Compressed Sparse Row) which indexes row-wise. The conversion can be done easily coo_matrix.tocsr()
. No other change is required, this worked for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With