I'm trying to classify instances of a dataset as being in one of two classes, a or b. B is a minority class and only makes up 8% of the dataset. All instances are assigned an id indicating which subject generated the data. Because every subject generated multiple instances id's are repeated frequently in the dataset.
The table below is just an example, the real table has about 100,000 instances. Each subject id has about 100 instances in the table. Every subject is tied to exactly one class as you can see with "larry" below.
* field * field * id * class
*******************************************
0 * _ * _ * bob * a
1 * _ * _ * susan * a
2 * _ * _ * susan * a
3 * _ * _ * bob * a
4 * _ * _ * larry * b
5 * _ * _ * greg * a
6 * _ * _ * larry * b
7 * _ * _ * bob * a
8 * _ * _ * susan * a
9 * _ * _ * susan * a
10 * _ * _ * bob * a
11 * _ * _ * greg * a
... ... ... ... ...
I would like to use cross-validation to tune the model and must stratify the dataset so that each fold contains a few examples of the minority class, b. The problem is that I have a second constraint, the same id must never appear in two different folds as this would leak information about the subject.
I'm using python's scikit-learn library. I need a method which combines both LabelKFold, which makes sure labels (id's) are not split among folds, and StratifiedKFold, which makes sure every fold has a similar ratio of classes. How can I accomplish the above using scikit-learn? If it is not possible to split on two constraints in sklearn how can I effectively split the dataset by hand or with other python libraries?
Stratified k-fold cross-validation is the same as just k-fold cross-validation, But Stratified k-fold cross-validation, it does stratified sampling instead of random sampling.
The k-fold cross-validation procedure is a standard method for estimating the performance of a machine learning algorithm on a dataset.
Repeated k-fold cross-validation provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs.
The following is a bit tricky with respect to indexing (it would help if you use something like Pandas for it), but conceptually simple.
Suppose you make a dummy dataset where the independent variables are only id
and class
. Furthermore, in this dataset, remove duplicate id
entries.
For your cross validation, run stratified cross validation on the dummy dataset. At each iteration:
Find out which id
s were selected for the train and the test
Go back to the original dataset, and insert all the instances belonging to id
as necessary into train and test sets.
This works because:
As you stated, each id
is associated with a single label.
Since we run stratified CV, each class is represented proportionally.
Since each id
appears only in the train or test set (but not both), it is labeled too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With