Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stratified Labeled K-Fold Cross-Validation In Scikit-Learn

Tags:

I'm trying to classify instances of a dataset as being in one of two classes, a or b. B is a minority class and only makes up 8% of the dataset. All instances are assigned an id indicating which subject generated the data. Because every subject generated multiple instances id's are repeated frequently in the dataset.

The table below is just an example, the real table has about 100,000 instances. Each subject id has about 100 instances in the table. Every subject is tied to exactly one class as you can see with "larry" below.

    * field  * field  *   id   *  class  
*******************************************
 0  *   _    *   _    *  bob   *    a
 1  *   _    *   _    *  susan *    a
 2  *   _    *   _    *  susan *    a
 3  *   _    *   _    *  bob   *    a
 4  *   _    *   _    *  larry *    b
 5  *   _    *   _    *  greg  *    a
 6  *   _    *   _    *  larry *    b
 7  *   _    *   _    *  bob   *    a
 8  *   _    *   _    *  susan *    a
 9  *   _    *   _    *  susan *    a
 10 *   _    *   _    *  bob   *    a
 11 *   _    *   _    *  greg  *    a
 ...   ...      ...      ...       ...

I would like to use cross-validation to tune the model and must stratify the dataset so that each fold contains a few examples of the minority class, b. The problem is that I have a second constraint, the same id must never appear in two different folds as this would leak information about the subject.

I'm using python's scikit-learn library. I need a method which combines both LabelKFold, which makes sure labels (id's) are not split among folds, and StratifiedKFold, which makes sure every fold has a similar ratio of classes. How can I accomplish the above using scikit-learn? If it is not possible to split on two constraints in sklearn how can I effectively split the dataset by hand or with other python libraries?

like image 247
Chris F. Avatar asked Sep 03 '16 14:09

Chris F.


People also ask

How is k-fold cross-validation different from stratified k-fold cross-validation?

Stratified k-fold cross-validation is the same as just k-fold cross-validation, But Stratified k-fold cross-validation, it does stratified sampling instead of random sampling.

What is k-fold cross-validation Sklearn?

The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm on a dataset.

What is repeated stratified k-fold cross-validation?

Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs.


1 Answers

The following is a bit tricky with respect to indexing (it would help if you use something like Pandas for it), but conceptually simple.

Suppose you make a dummy dataset where the independent variables are only id and class. Furthermore, in this dataset, remove duplicate id entries.

For your cross validation, run stratified cross validation on the dummy dataset. At each iteration:

  1. Find out which ids were selected for the train and the test

  2. Go back to the original dataset, and insert all the instances belonging to id as necessary into train and test sets.

This works because:

  1. As you stated, each id is associated with a single label.

  2. Since we run stratified CV, each class is represented proportionally.

  3. Since each id appears only in the train or test set (but not both), it is labeled too.

like image 70
Ami Tavory Avatar answered Sep 25 '22 14:09

Ami Tavory