It seems that k-fold cross validation in convn net is not taken seriously due to huge running time of the neural network. I have a small data-set and I am interested in doing k-fold cross validation using the example given here. Is it possible? Thanks.
If you are using images with data generators, here's one way to do 10-fold cross-validation with Keras and scikit-learn. The strategy is to copy the files to training, validation, and test subfolders according to each fold.
import numpy as np
import os
import pandas as pd
import shutil
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# used to copy files according to each fold
def copy_images(df, directory):
destination_directory = "{path to your data directory}/" + directory
print("copying {} files to {}...".format(directory, destination_directory))
# remove all files from previous fold
if os.path.exists(destination_directory):
shutil.rmtree(destination_directory)
# create folder for files from this fold
if not os.path.exists(destination_directory):
os.makedirs(destination_directory)
# create subfolders for each class
for c in set(list(df['class'])):
if not os.path.exists(destination_directory + '/' + c):
os.makedirs(destination_directory + '/' + c)
# copy files for this fold from a directory holding all the files
for i, row in df.iterrows():
try:
# this is the path to all of your images kept together in a separate folder
path_from = "{path to all of your images}"
path_from = path_from + "{}.jpg"
path_to = "{}/{}".format(destination_directory, row['class'])
# move from folder keeping all files to training, test, or validation folder (the "directory" argument)
shutil.copy(path_from.format(row['filename']), path_to)
except Exception, e:
print("Error when copying {}: {}".format(row['filename'], str(e)))
# dataframe containing the filenames of the images (e.g., GUID filenames) and the classes
df = pd.read_csv('{path to your data}.csv')
df_y = df['class']
df_x = df
del df_x['class']
skf = StratifiedKFold(n_splits = 10)
total_actual = []
total_predicted = []
total_val_accuracy = []
total_val_loss = []
total_test_accuracy = []
for i, (train_index, test_index) in enumerate(skf.split(df_x, df_y)):
x_train, x_test = df_x.iloc[train_index], df_x.iloc[test_index]
y_train, y_test = df_y.iloc[train_index], df_y.iloc[test_index]
train = pd.concat([x_train, y_train], axis=1)
test = pd.concat([x_test, y_test], axis = 1)
# take 20% of the training data from this fold for validation during training
validation = train.sample(frac = 0.2)
# make sure validation data does not include training data
train = train[~train['filename'].isin(list(validation['filename']))]
# copy the images according to the fold
copy_images(train, 'training')
copy_images(validation, 'validation')
copy_images(test, 'test')
print('**** Running fold '+ str(i))
# here you call a function to create and train your model, returning validation accuracy and validation loss
val_accuracy, val_loss = create_train_model();
# append validation accuracy and loss for average calculation later on
total_val_accuracy.append(val_accuracy)
total_val_loss.append(val_loss)
# here you will call a predict() method that will predict the images on the "test" subfolder
# this function returns the actual classes and the predicted classes in the same order
actual, predicted = predict()
# append accuracy from the predictions on the test data
total_test_accuracy.append(accuracy_score(actual, predicted))
# append all of the actual and predicted classes for your final evaluation
total_actual = total_actual + actual
total_predicted = total_predicted + predicted
# this is optional, but you can also see the performance on each fold as the process goes on
print(classification_report(total_actual, total_predicted))
print(confusion_matrix(total_actual, total_predicted))
print(classification_report(total_actual, total_predicted))
print(confusion_matrix(total_actual, total_predicted))
print("Validation accuracy on each fold:")
print(total_val_accuracy)
print("Mean validation accuracy: {}%".format(np.mean(total_val_accuracy) * 100))
print("Validation loss on each fold:")
print(total_val_loss)
print("Mean validation loss: {}".format(np.mean(total_val_loss)))
print("Test accuracy on each fold:")
print(total_test_accuracy)
print("Mean test accuracy: {}%".format(np.mean(total_test_accuracy) * 100))
In your predict() function, if you are using a data generator, the only way I could find to keep the predictions in the same order when testing was to use a batch_size of 1:
generator = ImageDataGenerator().flow_from_directory(
'{path to your data directory}/test',
target_size = (img_width, img_height),
batch_size = 1,
color_mode = 'rgb',
# categorical for a multiclass problem
class_mode = 'categorical',
# this will also ensure the same order
shuffle = False)
With this code, I was able to do 10-fold cross-validation using data generators (so I did not have to keep all files in memory). This can be a lot of work if you have millions of images and the batch_size = 1 could be a bottleneck if your test set is large, but for my project this worked well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With