Complex dataset split - StratifiedGroupShuffleSplit

Tags:

I have a dataset of ~2m observations which I need to split into training, validation and test sets in the ratio 60:20:20. A simplified excerpt of my dataset looks like this:

+---------+------------+-----------+-----------+
| note_id | subject_id | category  |   note    |
+---------+------------+-----------+-----------+
|       1 |          1 | ECG       | blah ...  |
|       2 |          1 | Discharge | blah ...  |
|       3 |          1 | Nursing   | blah ...  |
|       4 |          2 | Nursing   | blah ...  |
|       5 |          2 | Nursing   | blah ...  |
|       6 |          3 | ECG       | blah ...  |
+---------+------------+-----------+-----------+

There are multiple categories - which are not evenly balanced - so I need to ensure that the training, validation and test sets all have the same proportions of categories as in the original dataset. This part is fine, I can just use StratifiedShuffleSplit from the sklearn library.

However, I also need to ensure that the observations from each subject are not split across the training, validation and test datasets. All the observations from a given subject need to be in the same bucket to ensure my trained model has never seen the subject before when it comes to validation/testing. E.g. every observation of subject_id 1 should be in the training set.

I can't think of a way to ensure a stratified split by category, prevent contamination (for want of a better word) of subject_id across datasets, ensure a 60:20:20 split and ensure that the dataset is somehow shuffled. Any help would be appreciated!

Thanks!

EDIT:

I've now learnt that grouping by a category and keeping groups together across dataset splits can also be accomplished by sklearn through the GroupShuffleSplit function. So essentially, what I need is a combined stratified and grouped shuffle split i.e. StratifiedGroupShuffleSplit which does not exist. Github issue: https://github.com/scikit-learn/scikit-learn/issues/12076

561

asked Jul 03 '19 14:07

amin_nejad

2 Answers

Essentially I need StratifiedGroupShuffleSplit which does not exist (Github issue). This is because the behaviour of such a function is unclear and accomplishing this to yield a dataset which is both grouped and stratified is not always possible (also discussed here) - especially with a heavily imbalanced dataset such as mine. In my case, I want grouping to be done strictly to ensure there is no overlap of groups whatsoever whilst stratification and the dataset ratio split of 60:20:20 to be done approximately i.e. as well as is possible.

As Ghanem mentions, I have no choice but to build a function to split the dataset myself, which I have done below:

def StratifiedGroupShuffleSplit(df_main):

    df_main = df_main.reindex(np.random.permutation(df_main.index)) # shuffle dataset

    # create empty train, val and test datasets
    df_train = pd.DataFrame()
    df_val = pd.DataFrame()
    df_test = pd.DataFrame()

    hparam_mse_wgt = 0.1 # must be between 0 and 1
    assert(0 <= hparam_mse_wgt <= 1)
    train_proportion = 0.6 # must be between 0 and 1
    assert(0 <= train_proportion <= 1)
    val_test_proportion = (1-train_proportion)/2

    subject_grouped_df_main = df_main.groupby(['subject_id'], sort=False, as_index=False)
    category_grouped_df_main = df_main.groupby('category').count()[['subject_id']]/len(df_main)*100

    def calc_mse_loss(df):
        grouped_df = df.groupby('category').count()[['subject_id']]/len(df)*100
        df_temp = category_grouped_df_main.join(grouped_df, on = 'category', how = 'left', lsuffix = '_main')
        df_temp.fillna(0, inplace=True)
        df_temp['diff'] = (df_temp['subject_id_main'] - df_temp['subject_id'])**2
        mse_loss = np.mean(df_temp['diff'])
        return mse_loss

    i = 0
    for _, group in subject_grouped_df_main:

        if (i < 3):
            if (i == 0):
                df_train = df_train.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue
            elif (i == 1):
                df_val = df_val.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue
            else:
                df_test = df_test.append(pd.DataFrame(group), ignore_index=True)
                i += 1
                continue

        mse_loss_diff_train = calc_mse_loss(df_train) - calc_mse_loss(df_train.append(pd.DataFrame(group), ignore_index=True))
        mse_loss_diff_val = calc_mse_loss(df_val) - calc_mse_loss(df_val.append(pd.DataFrame(group), ignore_index=True))
        mse_loss_diff_test = calc_mse_loss(df_test) - calc_mse_loss(df_test.append(pd.DataFrame(group), ignore_index=True))

        total_records = len(df_train) + len(df_val) + len(df_test)

        len_diff_train = (train_proportion - (len(df_train)/total_records))
        len_diff_val = (val_test_proportion - (len(df_val)/total_records))
        len_diff_test = (val_test_proportion - (len(df_test)/total_records)) 

        len_loss_diff_train = len_diff_train * abs(len_diff_train)
        len_loss_diff_val = len_diff_val * abs(len_diff_val)
        len_loss_diff_test = len_diff_test * abs(len_diff_test)

        loss_train = (hparam_mse_wgt * mse_loss_diff_train) + ((1-hparam_mse_wgt) * len_loss_diff_train)
        loss_val = (hparam_mse_wgt * mse_loss_diff_val) + ((1-hparam_mse_wgt) * len_loss_diff_val)
        loss_test = (hparam_mse_wgt * mse_loss_diff_test) + ((1-hparam_mse_wgt) * len_loss_diff_test)

        if (max(loss_train,loss_val,loss_test) == loss_train):
            df_train = df_train.append(pd.DataFrame(group), ignore_index=True)
        elif (max(loss_train,loss_val,loss_test) == loss_val):
            df_val = df_val.append(pd.DataFrame(group), ignore_index=True)
        else:
            df_test = df_test.append(pd.DataFrame(group), ignore_index=True)

        print ("Group " + str(i) + ". loss_train: " + str(loss_train) + " | " + "loss_val: " + str(loss_val) + " | " + "loss_test: " + str(loss_test) + " | ")
        i += 1

    return df_train, df_val, df_test

df_train, df_val, df_test = StratifiedGroupShuffleSplit(df_main)

I have created some arbitrary loss function based on 2 things:

The average squared difference in the percentage representation of each category compared to the overall dataset
The squared difference between the proportional length of the dataset compared to what it should be according to the ratio supplied (60:20:20)

Weighting these two inputs to the loss function is done by the static hyperparameter hparam_mse_wgt. For my particular dataset, a value of 0.1 worked well but I would encourage you to play around with it if you use this function. Setting it to 0 will prioritise only maintaining the split ratio and ignore the stratification. Setting it to 1 would be vice versa.

Using this loss function, I then iterate through each subject (group) and append it to the appropriate dataset (training, validation or test) according to whichever has the highest loss function.

It's not particularly complicated but it does the job for me. It won't necessarily work for every dataset, but the larger it is, the better the chance. Hopefully someone else will find it useful.

174

answered Sep 22 '22 00:09

amin_nejad

this got more than a year, but i found my self in a similare situation where i have labels and a groups, and due to the nature of the groups one group of data points can be either in test only or in train only, i've wrote this a small algo using pandas and sklearn i hope this would help

from sklearn.model_selection import GroupShuffleSplit
groups = df.groupby('label')
all_train = []
all_test = []
for group_id, group in groups:
    # if a group is already taken in test or train it must stay there
    group = group[~group['groups'].isin(all_train+all_test)]
    # if group is empty 
    if group.shape[0] == 0:
        continue
    train_inds, test_inds = next(GroupShuffleSplit(
        test_size=valid_size, n_splits=2, random_state=7).split(group, groups=group['groups']))

    all_train += group.iloc[train_inds]['groups'].tolist()
    all_test += group.iloc[test_inds]['groups'].tolist()



train= df[df['groups'].isin(all_train)]
test= df[df['groups'].isin(all_test)]

form_train = set(train['groups'].tolist())
form_test = set(test['groups'].tolist())
inter = form_train.intersection(form_test)

print(df.groupby('label').count())
print(train.groupby('label').count())
print(test.groupby('label').count())
print(inter) # this should be empty

answered Sep 23 '22 00:09

samai mohammed

Related questions
                            
                                Using Pandas 'categorical' dtype with sklearn
                            
                                Python module 'os' has no attribute 'mknod'
                            
                                Error when building seq2seq model with tensorflow
                            
                                Python type hints and `*args`
                            
                                Why does handling multiple exceptions require a tuple, but not a list?
                            
                                How to get comparable and reproducible results from LogisticRegressionCV and GridSearchCV
                            
                                compare list of datetime to dict of datetime
                            
                                Pagination on pandas dataframe.to_html()
                            
                                TypeError when combining ABCMeta with __init_subclass__ in Python 3.6
                            
                                Can I use the generic logging.Logger() in Celery tasks?
                            
                                ValueError: "needs to have a value for field "id" before this many-to-many relationship can be used"
                            
                                How to prepare data for LSTM when using multiple time series of different lengths and multiple features?
                            
                                Is there a function to extract image patches in PyTorch?
                            
                                Python - Log memory usage
                            
                                Django swagger- How to disable DjangoFilterBackend query filters from delete, put methods?
                            
                                Pyinstaller on a setuptools package
                            
                                python error Suppressing signal 18 to win32
                            
                                Python Unittest: No tests discovered in Visual Studio Code
                            
                                How do I make a python dataclass inherit __hash__?
                            
                                Can conda environment inherit base packages?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Complex dataset split - StratifiedGroupShuffleSplit

Tags:

python

machine-learning

dataset

scikit-learn

amin_nejad

People also ask

2 Answers

amin_nejad

samai mohammed

Recent Activity

Donate For Us