I'm getting the following error when performing recursive feature selection with cross-validation: <pre class="prettyprint"><code>Traceback (most recent call last): File "/Users/.../srl/main.py", line 32, in <module> argident_sys.train_classifier() File "/Users/.../srl/identification.py", line 194, in train_classifier feat_selector.fit(train_argcands_feats,train_argcands_target) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 298, in fit ranking_ = rfe.fit(X[train], y[train]).ranking_ TypeError: only integer arrays with one element can be converted to an index </code></pre> The code that generates the error is the following: <pre class="prettyprint"><code>def train_classifier(self): # Get the argument candidates argcands = self.get_argcands(self.reader) # Extract the necessary features from the argument candidates train_argcands_feats = [] train_argcands_target = [] for argcand in argcands: train_argcands_feats.append(self.extract_features(argcand)) if argcand["info"]["label"] == "NULL": train_argcands_target.append("NULL") else: train_argcands_target.append("ARG") # Transform the features to the format required by the classifier self.feat_vectorizer = DictVectorizer() train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats) # Transform the target labels to the format required by the classifier self.target_names = list(set(train_argcands_target)) train_argcands_target = [self.target_names.index(target) for target in train_argcands_target] ## Train the appropriate supervised model # Recursive Feature Elimination self.classifier = LogisticRegression() feat_selector = RFECV(estimator=self.classifier, step=1, cv=StratifiedKFold(train_argcands_target, 10)) feat_selector.fit(train_argcands_feats,train_argcands_target) print feat_selector.n_features_ print feat_selector.support_ print feat_selector.ranking_ print feat_selector.cv_scores_ return </code></pre> I know I should also perform GridSearch for the parameters of the LogisticRegression classifier, but I don't think that's the source of the error (or is it?). I should mention that I'm testing with around 50 features, and almost all of them are categoric (that's why I use the DictVectorizer to transform them appropriately). Any help or guidance you could give me is more than welcome. Thanks! EDIT Here's some training data examples: <pre class="prettyprint"><code>train_argcands_feats = [{'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'dado', 'head': u'dado', 'head_postag': u'N'}, {'head_lemma': u'postura', 'head': u'postura', 'head_postag': u'N'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'de', 'head': u'de', 'head_postag': u'PRP'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'muito', 'head': u'Muitas', 'head_postag': u'PRON-DET'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}, {'head_lemma': u'com', 'head': u'com', 'head_postag': u'PRP'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}] train_argcands_target = ['NULL', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'NULL', 'NULL'] </code></pre>

I finally got to solve the problem. Two things had to be done: <ol> <li> train_argcands_target is a list and it has to be a numpy array. I'm surprised it worked well before when I just used the estimator directly.</li> <li>For some reason (I don't know why, yet), it doesn't work either if I use the sparse matrix created by the DictVectorizer. I had to, "manually", transform each feature dictionary to a feature array with just integers representing each feature value. The transformation process is similar to the one I present in the code for the target values.</li> </ol> Thanks to everyone who tried to help!

TypeError: only integer arrays with one element can be converted to an index

Tags:

python

scikit-learn

feature-selection

I'm getting the following error when performing recursive feature selection with cross-validation:

Click to copy

Traceback (most recent call last):
  File "/Users/.../srl/main.py", line 32, in <module>
    argident_sys.train_classifier()
  File "/Users/.../srl/identification.py", line 194, in train_classifier
    feat_selector.fit(train_argcands_feats,train_argcands_target)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 298, in fit
    ranking_ = rfe.fit(X[train], y[train]).ranking_
TypeError: only integer arrays with one element can be converted to an index

The code that generates the error is the following:

Click to copy

def train_classifier(self):

    # Get the argument candidates
    argcands = self.get_argcands(self.reader)

    # Extract the necessary features from the argument candidates
    train_argcands_feats = []
    train_argcands_target = []

    for argcand in argcands:
        train_argcands_feats.append(self.extract_features(argcand))
        if argcand["info"]["label"] == "NULL":
            train_argcands_target.append("NULL")
        else:
            train_argcands_target.append("ARG")

    # Transform the features to the format required by the classifier
    self.feat_vectorizer = DictVectorizer()
    train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)

    # Transform the target labels to the format required by the classifier
    self.target_names = list(set(train_argcands_target))
    train_argcands_target = [self.target_names.index(target) for target in train_argcands_target]

    ## Train the appropriate supervised model      

    # Recursive Feature Elimination
    self.classifier = LogisticRegression()
    feat_selector = RFECV(estimator=self.classifier, step=1, cv=StratifiedKFold(train_argcands_target, 10))

    feat_selector.fit(train_argcands_feats,train_argcands_target)

    print feat_selector.n_features_
    print feat_selector.support_
    print feat_selector.ranking_
    print feat_selector.cv_scores_

    return

I know I should also perform GridSearch for the parameters of the LogisticRegression classifier, but I don't think that's the source of the error (or is it?).

I should mention that I'm testing with around 50 features, and almost all of them are categoric (that's why I use the DictVectorizer to transform them appropriately).

Any help or guidance you could give me is more than welcome. Thanks!

EDIT

Here's some training data examples:

Click to copy

train_argcands_feats = [{'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'dado', 'head': u'dado', 'head_postag': u'N'}, {'head_lemma': u'postura', 'head': u'postura', 'head_postag': u'N'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'participar', 'head': u'participando', 'head_postag': u'V-GER'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'de', 'head': u'de', 'head_postag': u'PRP'}, {'head_lemma': u'governo', 'head': u'Governo', 'head_postag': u'N'}, {'head_lemma': u'recusar', 'head': u'recusando', 'head_postag': u'V-GER'}, {'head_lemma': u'maioria', 'head': u'maioria', 'head_postag': u'N'}, {'head_lemma': u'querer', 'head': u'quer', 'head_postag': u'V-FIN'}, {'head_lemma': u'PT', 'head': u'PT', 'head_postag': u'PROP'}, {'head_lemma': u'surpreendente', 'head': u'supreendente', 'head_postag': u'ADJ'}, {'head_lemma': u'Bras\xedlia', 'head': u'Bras\xedlia', 'head_postag': u'PROP'}, {'head_lemma': u'Pesquisa_Datafolha', 'head': u'Pesquisa_Datafolha', 'head_postag': u'N'}, {'head_lemma': u'revelar', 'head': u'revela', 'head_postag': u'V-FIN'}, {'head_lemma': u'muito', 'head': u'Muitas', 'head_postag': u'PRON-DET'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}, {'head_lemma': u'com', 'head': u'com', 'head_postag': u'PRP'}, {'head_lemma': u'prioridade', 'head': u'prioridades', 'head_postag': u'N'}]

train_argcands_target = ['NULL', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'ARG', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'NULL', 'NULL', 'NULL', 'NULL', 'ARG', 'ARG', 'NULL', 'NULL']

856

asked Sep 16 '12 03:09

feralvam

2 Answers

I finally got to solve the problem. Two things had to be done:

train_argcands_target is a list and it has to be a numpy array. I'm surprised it worked well before when I just used the estimator directly.
For some reason (I don't know why, yet), it doesn't work either if I use the sparse matrix created by the DictVectorizer. I had to, "manually", transform each feature dictionary to a feature array with just integers representing each feature value. The transformation process is similar to the one I present in the code for the target values.

Thanks to everyone who tried to help!

169

answered Sep 18 '22 14:09

feralvam

If anyone is still interested,

I used the CountVectorizer on something very similar and it gave me the same error. I realized that the vectorizer gives me a COO sparse matrix which is basically a coordinate list. Elements in COO matrices can't be accessed through row indexes. It is best to convert it to a CSR matrix (Compressed Sparse Row) which indexes row-wise. The conversion can be done easily coo_matrix.tocsr(). No other change is required, this worked for me.

answered Sep 18 '22 14:09

user926321

Related questions
                            
                                How to multiply numpy 2D array with numpy 1D array?
                            
                                Numpy concatenate 2D arrays with 1D array
                            
                                Using GridSearchCV with AdaBoost and DecisionTreeClassifier
                            
                                What is the difference between ! and % in Jupyter notebooks?
                            
                                Is there a way to convert CSV columns into hierarchical relationships?
                            
                                Get dimensions of a video file
                            
                                Why do ints require three times as much memory in Python?
                            
                                How to update code from Git to a Docker container
                            
                                why would a django test fail only when the full test suite is run?
                            
                                Call Python code from an existing project written in Swift
                            
                                Change the number of request retries in boto3
                            
                                pandas add column to groupby dataframe
                            
                                Writing Dask partitions into single file
                            
                                Loop over results from Path.glob() (Pathlib) [duplicate]
                            
                                How to disable printing reports after each epoch in Keras?
                            
                                How to hide hover tooltips on Spyder 4
                            
                                numpy array with dtype Decimal?
                            
                                Alternative to python string item assignment
                            
                                Controlling scheduling priority of python threads?
                            
                                convert binary string to numpy array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

TypeError: only integer arrays with one element can be converted to an index

Tags:

python

scikit-learn

feature-selection

feralvam

People also ask

2 Answers

feralvam

user926321

Recent Activity

Donate For Us