I'm trying to classify instances of a dataset as being in one of two classes, a or b. B is a minority class and only makes up 8% of the dataset. All instances are assigned an id indicating which subject generated the data. Because every subject generated multiple instances id's are repeated frequently in the dataset. The table below is just an example, the real table has about 100,000 instances. Each subject id has about 100 instances in the table. Every subject is tied to exactly one class as you can see with "larry" below. <pre class="prettyprint"><code> * field * field * id * class ******************************************* 0 * _ * _ * bob * a 1 * _ * _ * susan * a 2 * _ * _ * susan * a 3 * _ * _ * bob * a 4 * _ * _ * larry * b 5 * _ * _ * greg * a 6 * _ * _ * larry * b 7 * _ * _ * bob * a 8 * _ * _ * susan * a 9 * _ * _ * susan * a 10 * _ * _ * bob * a 11 * _ * _ * greg * a ... ... ... ... ... </code></pre> I would like to use cross-validation to tune the model and must stratify the dataset so that each fold contains a few examples of the minority class, b. The problem is that I have a second constraint, the same id must never appear in two different folds as this would leak information about the subject. I'm using python's scikit-learn library. I need a method which combines both LabelKFold, which makes sure labels (id's) are not split among folds, and StratifiedKFold, which makes sure every fold has a similar ratio of classes. How can I accomplish the above using scikit-learn? If it is not possible to split on two constraints in sklearn how can I effectively split the dataset by hand or with other python libraries?

The following is a bit tricky with respect to indexing (it would help if you use something like Pandas for it), but conceptually simple. Suppose you make a dummy dataset where the independent variables are only <code>id</code> and <code>class</code>. Furthermore, in this dataset, remove duplicate <code>id</code> entries. For your cross validation, run stratified cross validation on the dummy dataset. At each iteration: <ol> <li>Find out which <code>id</code>s were selected for the train and the test</li> <li>Go back to the original dataset, and insert all the instances belonging to <code>id</code> as necessary into train and test sets.</li> </ol> This works because: <ol> <li>As you stated, each <code>id</code> is associated with a single label.</li> <li>Since we run stratified CV, each class is represented proportionally.</li> <li>Since each <code>id</code> appears only in the train or test set (but not both), it is labeled too.</li> </ol>

Stratified Labeled K-Fold Cross-Validation In Scikit-Learn

Tags:

I'm trying to classify instances of a dataset as being in one of two classes, a or b. B is a minority class and only makes up 8% of the dataset. All instances are assigned an id indicating which subject generated the data. Because every subject generated multiple instances id's are repeated frequently in the dataset.

The table below is just an example, the real table has about 100,000 instances. Each subject id has about 100 instances in the table. Every subject is tied to exactly one class as you can see with "larry" below.

    * field  * field  *   id   *  class  
*******************************************
 0  *   _    *   _    *  bob   *    a
 1  *   _    *   _    *  susan *    a
 2  *   _    *   _    *  susan *    a
 3  *   _    *   _    *  bob   *    a
 4  *   _    *   _    *  larry *    b
 5  *   _    *   _    *  greg  *    a
 6  *   _    *   _    *  larry *    b
 7  *   _    *   _    *  bob   *    a
 8  *   _    *   _    *  susan *    a
 9  *   _    *   _    *  susan *    a
 10 *   _    *   _    *  bob   *    a
 11 *   _    *   _    *  greg  *    a
 ...   ...      ...      ...       ...

I would like to use cross-validation to tune the model and must stratify the dataset so that each fold contains a few examples of the minority class, b. The problem is that I have a second constraint, the same id must never appear in two different folds as this would leak information about the subject.

I'm using python's scikit-learn library. I need a method which combines both LabelKFold, which makes sure labels (id's) are not split among folds, and StratifiedKFold, which makes sure every fold has a similar ratio of classes. How can I accomplish the above using scikit-learn? If it is not possible to split on two constraints in sklearn how can I effectively split the dataset by hand or with other python libraries?

247

asked Sep 03 '16 14:09

Chris F.

1 Answers

The following is a bit tricky with respect to indexing (it would help if you use something like Pandas for it), but conceptually simple.

Suppose you make a dummy dataset where the independent variables are only id and class. Furthermore, in this dataset, remove duplicate id entries.

For your cross validation, run stratified cross validation on the dummy dataset. At each iteration:

Find out which ids were selected for the train and the test
Go back to the original dataset, and insert all the instances belonging to id as necessary into train and test sets.

This works because:

As you stated, each id is associated with a single label.
Since we run stratified CV, each class is represented proportionally.
Since each id appears only in the train or test set (but not both), it is labeled too.

answered Sep 25 '22 14:09

Ami Tavory

Related questions
                            
                                What represents reflection TypeVariable interface
                            
                                Materialize CSS Custom Form Validation Error Message
                            
                                How can I notify RxPY observers on separate threads using asyncio?
                            
                                SKCropNode fails when I add extra SKNode children in hierarchy
                            
                                imputing missing values using a predictive model
                            
                                Troubleshooting DLL function call from Excel Vba
                            
                                Inject EntityManagerFactory using @PersistenceUnit on Jersey with Wildfly
                            
                                iOS9 supportedInterfaceOrientationsForWindow stops getting called
                            
                                Python implementation of Mergesort for Linked list doesn't work
                            
                                dcast error: `Error in match(x, table, nomatch = 0L)`
                            
                                IE11 calculates the height of parent flex container incorrectly
                            
                                git clone in .bat script

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With