Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"

Tags:

I was trying to split the sample dataset using Scikit-learn's Stratified Shuffle Split. I followed the example shown on the Scikit-learn documentation here

import pandas as pd import numpy as np # UCI's wine dataset wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")  # separate target variable from dataset target = wine['quality'] data = wine.drop('quality',axis = 1)  # Stratified Split of train and test data from sklearn.cross_validation import StratifiedShuffleSplit sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)  for train_index, test_index in sss:     xtrain, xtest = data[train_index], data[test_index]     ytrain, ytest = target[train_index], target[test_index]  # Check target series for distribution of classes ytrain.value_counts() ytest.value_counts() 

However, upon running this script, I get the following error:

IndexError: indices are out-of-bounds 

Could someone please point out what I am doing wrong here? Thanks!

like image 940
Jason Avatar asked May 04 '15 06:05

Jason


Video Answer


1 Answers

You're running into the different conventions for Pandas DataFrame indexing versus NumPy ndarray indexing. The arrays train_index and test_index are collections of row indices. But data is a Pandas DataFrame object, and when you use a single index into that object, as in data[train_index], Pandas is expecting train_index to contain column labels rather than row indices. You can either convert the dataframe to a NumPy array, using .values:

data_array = data.values for train_index, test_index in sss:     xtrain, xtest = data_array[train_index], data_array[test_index]     ytrain, ytest = target[train_index], target[test_index] 

or use the Pandas .iloc accessor:

for train_index, test_index in sss:     xtrain, xtest = data.iloc[train_index], data.iloc[test_index]     ytrain, ytest = target[train_index], target[test_index] 

I tend to favour the second approach, since it gives xtrain and xtest of type DataFrame rather than ndarray, and so keeps the column labels.

like image 130
Mark Dickinson Avatar answered Sep 21 '22 23:09

Mark Dickinson