Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TypeError: only integer arrays with one element can be converted to an index

I'm getting the following error when performing recursive feature selection with cross-validation:

Traceback (most recent call last):
  File "/Users/.../srl/main.py", line 32, in <module>
    argident_sys.train_classifier()
  File "/Users/.../srl/identification.py", line 194, in train_classifier
    feat_selector.fit(train_argcands_feats,train_argcands_target)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 298, in fit
    ranking_ = rfe.fit(X[train], y[train]).ranking_
TypeError: only integer arrays with one element can be converted to an index

The code that generates the error is the following:

def train_classifier(self):

    # Get the argument candidates
    argcands = self.get_argcands(self.reader)

    # Extract the necessary features from the argument candidates
    train_argcands_feats = []
    train_argcands_target = []

    for argcand in argcands:
        train_argcands_feats.append(self.extract_features(argcand))
        if argcand["info"]["label"] == "NULL":
            train_argcands_target.append("NULL")
        else:
            train_argcands_target.append("ARG")

    # Transform the features to the format required by the classifier
    self.feat_vectorizer = DictVectorizer()
    train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)

    # Transform the target labels to the format required by the classifier
    self.target_names = list(set(train_argcands_target))
    train_argcands_target = [self.target_names.index(target) for target in train_argcands_target]

    ## Train the appropriate supervised model      

    # Recursive Feature Elimination
    self.classifier = LogisticRegression()
    feat_selector = RFECV(estimator=self.classifier, step=1, cv=StratifiedKFold(train_argcands_target, 10))

    feat_selector.fit(train_argcands_feats,train_argcands_target)

    print feat_selector.n_features_
    print feat_selector.support_
    print feat_selector.ranking_
    print feat_selector.cv_scores_

    return

I know I should also perform GridSearch for the parameters of the LogisticRegression classifier, but I don't think that's the source of the error (or is it?).

I should mention that I'm testing with around 50 features, and almost all of them are categoric (that's why I use the DictVectorizer to transform them appropriately).

Any help or guidance you could give me is more than welcome. Thanks!

EDIT

Here's some training data examples:

train_argcands_feats = [{'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'dado', 'head': u'dado', 'head_postag': u'N'}, {'head_lemma': u'postura', 'head': u'postura', 'head_postag': u'N'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'de', 'head': u'de', 'head_postag': u'PRP'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'muito', 'head': u'Muitas', 'head_postag': u'PRON-DET'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}, {'head_lemma': u'com', 'head': u'com', 'head_postag': u'PRP'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}]

train_argcands_target = ['NULL', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'NULL', 'NULL']
like image 856
feralvam Avatar asked Sep 16 '12 03:09

feralvam


People also ask

How do I fix TypeError only integer scalar arrays can be converted to a scalar index?

Error encountered is: "TypeError: only integer scalar arrays can be converted to a scalar index". This error comes when we handle Numpy array in the wrong way using Numpy utility functions. The fix is to refer the documentation of the Numpy function and use the function correctly.

What does only integer scalar arrays can be converted to a scalar index mean?

The error 'TypeError: only integer scalar arrays can be converted to a scalar index' occurs when you try to index an ordinary Python list with an array of values. If you want to index a list, you have to use single values.


2 Answers

I finally got to solve the problem. Two things had to be done:

  1. train_argcands_target is a list and it has to be a numpy array. I'm surprised it worked well before when I just used the estimator directly.
  2. For some reason (I don't know why, yet), it doesn't work either if I use the sparse matrix created by the DictVectorizer. I had to, "manually", transform each feature dictionary to a feature array with just integers representing each feature value. The transformation process is similar to the one I present in the code for the target values.

Thanks to everyone who tried to help!

like image 169
feralvam Avatar answered Sep 18 '22 14:09

feralvam


If anyone is still interested,

I used the CountVectorizer on something very similar and it gave me the same error. I realized that the vectorizer gives me a COO sparse matrix which is basically a coordinate list. Elements in COO matrices can't be accessed through row indexes. It is best to convert it to a CSR matrix (Compressed Sparse Row) which indexes row-wise. The conversion can be done easily coo_matrix.tocsr(). No other change is required, this worked for me.

like image 22
user926321 Avatar answered Sep 18 '22 14:09

user926321