Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overfitting and data leakage in tensorflow/keras neural network

Good morning, I'm new in machine learning and neural networks. I am trying to build a fully connected neural network to solve a regression problem. The dataset is composed by 18 features and 1 label, and all of these are physical quantities.

You can find the code below. I upload the figure of the loss function evolution along the epochs (you can find it below). I am not sure if there is overfitting. Someone can explain me why there is or not overfitting?

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing

from sklearn.model_selection import train_test_split

from matplotlib import pyplot as plt

import keras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from keras import optimizers
from sklearn.metrics import r2_score
from keras import regularizers
from keras import backend
from tensorflow.keras import regularizers
from keras.regularizers import l2

# =============================================================================
# Scelgo il test size
# =============================================================================
test_size = 0.2

dataset = pd.read_csv('DataSet.csv', decimal=',', delimiter = ";")

label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])

y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)

def denormalize(y):
    final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
    return final_value

# =============================================================================
# Split
# =============================================================================

X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = test_size, shuffle = True)

y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()

# =============================================================================
# Normalizzo
# =============================================================================
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)


scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)



# =============================================================================
# Creo la rete
# =============================================================================
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model = Sequential()

model.add(Dense(60, input_shape = (X_train.shape[1],), activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dropout(0.2))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dropout(0.2))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))

model.add(Dense(1,activation = 'linear',kernel_initializer='glorot_uniform'))

model.compile(loss = 'mse', optimizer = optimizer, metrics = ['mse'])

history = model.fit(X_train, y_train, epochs = 100,
                    validation_split = 0.1, shuffle=True, batch_size=250
                    )

history_dict = history.history

loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

y_train_pred = denormalize(y_train_pred)
y_test_pred = denormalize(y_test_pred)


plt.figure()
plt.plot((y_test1),(y_test_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')
plt.xlabel('True')
plt.ylabel('Predicted')
plt.title('Test')

plt.figure()
plt.plot((y_train1),(y_train_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')
plt.xlabel('True')
plt.ylabel('Predicted')
plt.title('Train')

plt.figure()
plt.plot(loss_values,'b',label = 'training loss')
plt.plot(val_loss_values,'r',label = 'val training loss')
plt.xlabel('Epochs')
plt.ylabel('Loss Function')
plt.legend()

print("\n\nThe R2 score on the test set is:\t{:0.3f}".format(r2_score(y_test_pred, y_test1)))

print("The R2 score on the train set is:\t{:0.3f}".format(r2_score(y_train_pred, y_train1)))
from sklearn import metrics

# Measure MSE error.  
score = metrics.mean_squared_error(y_test_pred,y_test1)
print("\n\nFinal score test (MSE): %0.4f" %(score))
score1 = metrics.mean_squared_error(y_train_pred,y_train1)
print("Final score train (MSE): %0.4f" %(score1))
score2 = np.sqrt(metrics.mean_squared_error(y_test_pred,y_test1))
print(f"Final score test (RMSE): %0.4f" %(score2))
score3 = np.sqrt(metrics.mean_squared_error(y_train_pred,y_train1))
print(f"Final score train (RMSE): %0.4f" %(score3))

enter image description here

EDIT:

I tried alse to do feature importances and to raise n_epochs, these are the results:

Feature Importance:

enter image description here

No Feature Importace:

enter image description here

like image 732
Gabriele Valvo Avatar asked Jan 22 '20 09:01

Gabriele Valvo


People also ask

How does Tensorflow deal with overfitting?

To prevent overfitting, the best solution is to use more complete training data. The dataset should cover the full range of inputs that the model is expected to handle. Additional data may only be useful if it covers new and interesting cases. A model trained on more complete data will naturally generalize better.

What is overfitting in keras?

Overfitting occurs when you achieve a good fit of your model on the training data, while it does not generalize well on new, unseen data. In other words, the model learned patterns specific to the training data, which are irrelevant in other data.

What is the main cause of overfitting in a neural network?

Deep neural networks are prone to overfitting because they learn millions or billions of parameters while building the model. A model having this many parameters can overfit the training data because it has sufficient capacity to do so.


1 Answers

Looks like you don't have overfitting! Your training and validation curves are descending together and converging. The clearest sign you could get of overfitting would be a deviation between these two curves, something like this: overfitting ecample

Since your two curves are descending and are not diverging, it indicates your NN training is healthy.

HOWEVER! Your validation curve is suspiciously below the training curve. This hints a possible data leakage (train and test data have been mixed somehow). More info on a nice an short blog post. In general, you should split the data before any other preprocessing (normalizing, augmentation, shuffling, etc...).

Other causes for this could be some type of regularization (dropout, BN, etc..) that is active while computing the training accuracy and it's deactivated when computing the Validation/Test accuracy.

like image 120
ibarrond Avatar answered Sep 23 '22 15:09

ibarrond