Cross validation for MNIST dataset with pytorch and sklearn

Tags:

I am new to pytorch and are trying to implement a feed forward neural network to classify the mnist data set. I have some problems when trying to use cross-validation. My data has the following shapes: x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000])

I tried to use KFold from sklearn.

kfold =KFold(n_splits=10)

Here is the first part of my train method where I'm dividing the data into folds:

for  train_index, test_index in kfold.split(x_train, y_train): 
        x_train_fold = x_train[train_index]
        x_test_fold = x_test[test_index]
        y_train_fold = y_train[train_index]
        y_test_fold = y_test[test_index]
        print(x_train_fold.shape)
        for epoch in range(epochs):
         ...

The indices for the y_train_fold variable is right, it's simply: [ 0 1 2 ... 4497 4498 4499], but it's not for x_train_fold, which is [ 4500 4501 4502 ... 44997 44998 44999]. And the same goes for the test folds.

For the first iteration I want the varibale x_train_fold to be the first 4500 pictures, in other words to have the shape torch.Size([4500, 784]), but it has the shape torch.Size([40500, 784])

Any tips on how to get this right?

490

asked Nov 22 '19 14:11

Kimmen

1 Answers

I think you're confused!

Ignore the second dimension for a while, When you've 45000 points, and you use 10 fold cross-validation, what's the size of each fold? 45000/10 i.e. 4500.

It means that each of your fold will contain 4500 data points, and one of those fold will be used for testing, and the remaining for training i.e.

For testing: one fold => 4500 data points => size: 4500
For training: remaining folds => 45000-4500 data points => size: 45000-4500=40500

Thus, for first iteration, the first 4500 data points (corresponding to indices) will be used for testing and the rest for training. (Check below image)

Given your data is x_train: torch.Size([45000, 784]) and y_train: torch.Size([45000]), this is how your code should look like:

for train_index, test_index in kfold.split(x_train, y_train):  
    print(train_index, test_index)

    x_train_fold = x_train[train_index] 
    y_train_fold = y_train[train_index] 
    x_test_fold = x_train[test_index] 
    y_test_fold = y_train[test_index] 

    print(x_train_fold.shape, y_train_fold.shape) 
    print(x_test_fold.shape, y_test_fold.shape) 
    break 

[ 4500  4501  4502 ... 44997 44998 44999] [   0    1    2 ... 4497 4498 4499]
torch.Size([40500, 784]) torch.Size([40500])
torch.Size([4500, 784]) torch.Size([4500])

So, when you say

I want the variable x_train_fold to be the first 4500 picture... shape torch.Size([4500, 784]).

you're wrong. this size corresonds to x_test_fold. In the first iteration, based on 10 folds, x_train_fold will have 40500 points, thus its size is supposed to be torch.Size([40500, 784]).

K-fold validation image

answered Nov 20 '22 05:11

kHarshit

Related questions
                            
                                Standardization/Normalization test data in Python
                            
                                Python scikit-learn to JSON
                            
                                How to check if sklearn model is classifier or regressor
                            
                                Difference between Shuffle and Random_State in train test split?
                            
                                Different error messages when using pip install, pip list ect
                            
                                kNN with big sparse matrices in Python
                            
                                OneHotEncoder with string categorical values
                            
                                sklearn's PLSRegression: "ValueError: array must not contain infs or NaNs"
                            
                                No module name 'sklearn.forest.ensemble'
                            
                                How to forecast in python using machine learning , from a given set of geographical data?
                            
                                Unintended multithreading in Python (scikit-learn)
                            
                                How to preprocess data for machine learning? [closed]
                            
                                Use of 'random_state' parameter in sklearn.utils.shuffle?
                            
                                How to randomly select rows from a data set using pandas?
                            
                                How to visualize an sklearn GradientBoostingClassifier?
                            
                                Unable to transform string column to categorical matrix using Keras and Sklearn
                            
                                How to implement polynomial logistic regression in scikit-learn?
                            
                                How does sklearn random forest index feature_importances_
                            
                                Why does not GridSearchCV give best score ? - Scikit Learn
                            
                                Find the tf-idf score of specific words in documents using sklearn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cross validation for MNIST dataset with pytorch and sklearn

Tags:

pytorch

scikit-learn

mnist

cross-validation

k-fold

Kimmen

People also ask

1 Answers

kHarshit

Recent Activity

Donate For Us