sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"

Question

I was trying to split the sample dataset using Scikit-learn's Stratified Shuffle Split. I followed the example shown on the Scikit-learn documentation here

import pandas as pd import numpy as np # UCI's wine dataset wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")  # separate target variable from dataset target = wine['quality'] data = wine.drop('quality',axis = 1)  # Stratified Split of train and test data from sklearn.cross_validation import StratifiedShuffleSplit sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)  for train_index, test_index in sss:     xtrain, xtest = data[train_index], data[test_index]     ytrain, ytest = target[train_index], target[test_index]  # Check target series for distribution of classes ytrain.value_counts() ytest.value_counts()

However, upon running this script, I get the following error:

IndexError: indices are out-of-bounds

Could someone please point out what I am doing wrong here? Thanks!

Mark Dickinson · Accepted Answer

You're running into the different conventions for Pandas DataFrame indexing versus NumPy ndarray indexing. The arrays train_index and test_index are collections of row indices. But data is a Pandas DataFrame object, and when you use a single index into that object, as in data[train_index], Pandas is expecting train_index to contain column labels rather than row indices. You can either convert the dataframe to a NumPy array, using .values:

data_array = data.values for train_index, test_index in sss:     xtrain, xtest = data_array[train_index], data_array[test_index]     ytrain, ytest = target[train_index], target[test_index]

or use the Pandas .iloc accessor:

for train_index, test_index in sss:     xtrain, xtest = data.iloc[train_index], data.iloc[test_index]     ytrain, ytest = target[train_index], target[test_index]

I tend to favour the second approach, since it gives xtrain and xtest of type DataFrame rather than ndarray, and so keeps the column labels.

sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"

Tags:

Jason

Video Answer

1 Answers

Mark Dickinson

Recent Activity

Donate For Us

sklearn.cross_validation.StratifiedShuffleSplit - error: "indices are out-of-bounds"

Tags:

Jason

Video Answer

1 Answers

Mark Dickinson

Related questions

Recent Activity

Donate For Us