Overfitting and data leakage in tensorflow/keras neural network

Tags:

Good morning, I'm new in machine learning and neural networks. I am trying to build a fully connected neural network to solve a regression problem. The dataset is composed by 18 features and 1 label, and all of these are physical quantities.

You can find the code below. I upload the figure of the loss function evolution along the epochs (you can find it below). I am not sure if there is overfitting. Someone can explain me why there is or not overfitting?

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt

import keras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from keras import optimizers
from sklearn.metrics import r2_score
from keras import regularizers
from keras import backend
from tensorflow.keras import regularizers
from keras.regularizers import l2

# =============================================================================
# Scelgo il test size
# =============================================================================
test_size = 0.2

dataset = pd.read_csv('DataSet.csv', decimal=',', delimiter = ";")

label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])

y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)

def denormalize(y):
    final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
    return final_value

# =============================================================================
# Split
# =============================================================================

X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = test_size, shuffle = True)

y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()

# =============================================================================
# Normalizzo
# =============================================================================
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)


scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)



# =============================================================================
# Creo la rete
# =============================================================================
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model = Sequential()

model.add(Dense(60, input_shape = (X_train.shape[1],), activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dropout(0.2))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dropout(0.2))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))

model.add(Dense(1,activation = 'linear',kernel_initializer='glorot_uniform'))

model.compile(loss = 'mse', optimizer = optimizer, metrics = ['mse'])

history = model.fit(X_train, y_train, epochs = 100,
                    validation_split = 0.1, shuffle=True, batch_size=250
                    )

history_dict = history.history

loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

y_train_pred = denormalize(y_train_pred)
y_test_pred = denormalize(y_test_pred)


plt.figure()
plt.plot((y_test1),(y_test_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')
plt.xlabel('True')
plt.ylabel('Predicted')
plt.title('Test')

plt.figure()
plt.plot((y_train1),(y_train_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')
plt.xlabel('True')
plt.ylabel('Predicted')
plt.title('Train')

plt.figure()
plt.plot(loss_values,'b',label = 'training loss')
plt.plot(val_loss_values,'r',label = 'val training loss')
plt.xlabel('Epochs')
plt.ylabel('Loss Function')
plt.legend()

print("\n\nThe R2 score on the test set is:\t{:0.3f}".format(r2_score(y_test_pred, y_test1)))

print("The R2 score on the train set is:\t{:0.3f}".format(r2_score(y_train_pred, y_train1)))
from sklearn import metrics

# Measure MSE error.  
score = metrics.mean_squared_error(y_test_pred,y_test1)
print("\n\nFinal score test (MSE): %0.4f" %(score))
score1 = metrics.mean_squared_error(y_train_pred,y_train1)
print("Final score train (MSE): %0.4f" %(score1))
score2 = np.sqrt(metrics.mean_squared_error(y_test_pred,y_test1))
print(f"Final score test (RMSE): %0.4f" %(score2))
score3 = np.sqrt(metrics.mean_squared_error(y_train_pred,y_train1))
print(f"Final score train (RMSE): %0.4f" %(score3))

enter image description here

EDIT:

I tried alse to do feature importances and to raise n_epochs, these are the results:

Feature Importance:

enter image description here

No Feature Importace:

enter image description here

732

asked Jan 22 '20 09:01

Gabriele Valvo

1 Answers

Looks like you don't have overfitting! Your training and validation curves are descending together and converging. The clearest sign you could get of overfitting would be a deviation between these two curves, something like this: overfitting ecample

Since your two curves are descending and are not diverging, it indicates your NN training is healthy.

HOWEVER! Your validation curve is suspiciously below the training curve. This hints a possible data leakage (train and test data have been mixed somehow). More info on a nice an short blog post. In general, you should split the data before any other preprocessing (normalizing, augmentation, shuffling, etc...).

Other causes for this could be some type of regularization (dropout, BN, etc..) that is active while computing the training accuracy and it's deactivated when computing the Validation/Test accuracy.

120

answered Sep 23 '22 15:09

ibarrond

Related questions
                            
                                LSTM Keras input shape confusion
                            
                                multiplying two int arrays in python
                            
                                Fill dataframe nan values from a join
                            
                                Pandas How to create a new dataframe with a start and end even if on different rows
                            
                                What is the difference between json() method and json.loads()
                            
                                tensorflow transition to gpu version
                            
                                Forward fill missing values by group after condition is met in pandas
                            
                                python-docx: Parse a table to Panda Dataframe
                            
                                Get visual feedback from QValidator
                            
                                How to set a value for a specific threshold in SVC model and generate a confusion matrix?
                            
                                Installing Python 3.8 on windows 7 32bit with SP1
                            
                                Display pandas dataframe with larger font in jupyter notebook
                            
                                Aiohttp logging: how to distinguish log messages of different requests?
                            
                                pandas group by and find first non null value for all columns
                            
                                Bayesian network in Python: both construction and sampling
                            
                                Can the "off" color be set for a Matplotlib dashed line?
                            
                                Python kernel dies on Jupyter Notebook with tensorflow 2
                            
                                how to get a continuous rolling mean in pandas?
                            
                                pandas - Splitting date ranges on specific day boundary
                            
                                Airflow task running tweepy exits with return code -6

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Overfitting and data leakage in tensorflow/keras neural network

Tags:

python

neural-network

figure

anaconda

spyder

Gabriele Valvo

People also ask

1 Answers

ibarrond

Recent Activity

Donate For Us