I have the following code
from sklearn.ensemble import ExtraTreesClassifier from sklearn.cross_validation import cross_val_score #split the dataset for train and test combnum['is_train'] = np.random.uniform(0, 1, len(combnum)) <= .75 train, test = combnum[combnum['is_train']==True], combnum[combnum['is_train']==False] et = ExtraTreesClassifier(n_estimators=200, max_depth=None, min_samples_split=10, random_state=0) min_samples_split=10, random_state=0 ) labels = train[list(label_columns)].values tlabels = test[list(label_columns)].values features = train[list(columns)].values tfeatures = test[list(columns)].values et_score = cross_val_score(et, features, labels, n_jobs=-1) print("{0} -> ET: {1})".format(label_columns, et_score))
Checking the shape of the arrays:
features.shape Out[19]:(43069, 34)
And
labels.shape Out[20]:(43069, 1)
and I'm getting:
IndexError: too many indices for array
and this relevant part of the traceback:
---> 22 et_score = cross_val_score(et, features, labels, n_jobs=-1)
I'm creating the data from Pandas dataframes and I searched here and saw some reference to possible errors via this method but can't figure out how to correct? What the data arrays look like: features
Out[21]: array([[ 0., 1., 1., ..., 0., 0., 1.], [ 0., 1., 1., ..., 0., 0., 1.], [ 1., 1., 1., ..., 0., 0., 1.], ..., [ 0., 0., 1., ..., 0., 0., 1.], [ 0., 0., 1., ..., 0., 0., 1.], [ 0., 0., 1., ..., 0., 0., 1.]])
labels
Out[22]: array([[1], [1], [1], ..., [1], [1], [1]])
The cross_validate function differs from cross_val_score in two ways: It allows specifying multiple metrics for evaluation. It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.
Cross_val_score in sklearn, what is it? Cross_val_score is a function in the scikit-learn package which trains and tests a model over multiple folds of your dataset. This cross validation method gives you a better understanding of model performance over the whole dataset instead of just a single train/test split.
"cross_val_score" splits the data into say 5 folds. Then for each fold it fits the data on 4 folds and scores the 5th fold. Then it gives you the 5 scores from which you can calculate a mean and variance for the score. You crossval to tune parameters and get an estimate of the score.
score() method will return the mean accuracy. With cross_val_score you are comparing one RandomForestClassifier model with some hyperparameters to another with different hyperparameters and selecting the best.
When we do cross validation in scikit-learn, the process requires an (R,)
shape label instead of (R,1)
. Although they are the same thing to some extend, their indexing mechanisms are different. So in your case, just add:
c, r = labels.shape labels = labels.reshape(c,)
before passing it to the cross-validation function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With