Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn error: The least populated class in y has only 1 member

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error:

In [1]: y.iloc[:,0].value_counts()
Out[1]: 
M2    38
M1    35
M4    29
M5    15
M0    15
M3    15

In [2]: xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85, stratify=y)
Out[2]: 
Traceback (most recent call last):
  File "run_ok.py", line 48, in <module>
    xtrain,xtest,ytrain,ytest = train_test_split(X,y,test_size=1/3,random_state=85,stratify=y)
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1700, in train_test_split
    train, test = next(cv.split(X=arrays[0], y=stratify))
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 953, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/home/aurora/.pyenv/versions/3.6.0/lib/python3.6/site-packages/sklearn/model_selection/_split.py", line 1259, in _iter_indices
    raise ValueError("The least populated class in y has only 1"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

However, all classes have at least 15 samples. Why am I getting this error?

X is a pandas DataFrame which represents the data points, y is a pandas DataFrame with one column that contains the target variable.

I cannot post the original data because it's proprietary, but it is fairly reproducible by creating a random pandas DataFrame (X) with 1k rows x 500 columns, and a random pandas DataFrame (y) with the same number of rows (1k) of X, and, for each row the target variable (a categorical label). The y pandas DataFrame should have different categorical labels (e.g. 'class1', 'class2'...) and each labels should have at least 15 occurrences.

like image 414
Aurora Avatar asked Apr 03 '17 08:04

Aurora


2 Answers

The main point is if you use stratified CV, then you will get this warning if the number of splits cannot produce all CV splits with the same ratio of all classes in the data. E.g. if you have 2 samples of one class, there will be 2 CV sets with 2 samples of this class, and 3 CV sets with 0 samples, hence the ratio samples for this class does not equal in all CV sets. But the problem is only if there is 0 samples in any of the sets, so if you have at least as many samples as the number of CV splits, i.e. 5 in this case, this warning won't appear.

See https://stackoverflow.com/a/48314533/2340939.

like image 54
user2340939 Avatar answered Sep 24 '22 21:09

user2340939


The problem was that train_test_split takes as input 2 arrays, but the y array is a one-column matrix. If I pass only the first column of y it works.

train, xtest, ytrain, ytest = train_test_split(X, y.iloc[:,1], test_size=1/3,
  random_state=85, stratify=y.iloc[:,1])
like image 38
Aurora Avatar answered Sep 24 '22 21:09

Aurora