Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sample_weight parameter shape error in scikit-learn GridSearchCV

Passing the sample_weight parameter to GridSearchCV raises an error due to incorrect shape. My suspicion is that cross validation is not capable of handling the split of sample_weights accordingly with the dataset.

First part: Using sample_weight as a model parameter works beautifully

Let's consider a simple example, first without GridSearch:

import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt


dataURL = 'https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sinusoidal_data.csv'

x = pd.read_csv(dataURL, usecols=["x"]).x
y = pd.read_csv(dataURL, usecols=["y"]).y
occurrences = pd.read_csv(dataURL, usecols=["Occurrences"]).Occurrences
my_sample_weights = (1 - occurrences/10000)**3

my_sample_weights contains the importance that I assign to each observation in x, y, as the following picture shows. The points of the sinusoidal curve get higher weights than those forming the background noise.

plt.scatter(x, y, c=my_sample_weights>0.9, cmap="cool")

Color coded dataset with respect to my_sample_weights

Let's train a neural network, first without using the information contained in my_sample_weights:

def make_model(number_of_hidden_neurons=1):
    model = Sequential()
    model.add(Dense(number_of_hidden_neurons, input_shape=(1,), activation='tanh'))
    model.add(Dense(1, activation='linear'))
    model.compile(optimizer='sgd', loss='mse')
    return model

net_Not_using_sample_weight = make_model(number_of_hidden_neurons=6)
net_Not_using_sample_weight.fit(x,y, epochs=1000)

plt.scatter(x, y, )
plt.scatter(x, net_Not_using_sample_weight.predict(x), c="green")

As the following picture shows, the neural network tries to fit the shape of the sinusoidal but the background noise prevents it from a good fit. enter image description here

Now, using the information of my_sample_weights , the quality of the prediction is a much better one. enter image description here

Second part: Using sample_weight as a GridSearchCV parameter raises an error

my_Regressor = KerasRegressor(make_model)

validator = GridSearchCV(my_Regressor,
                     param_grid={'number_of_hidden_neurons': range(4, 5),
                                 'epochs': [500],
                                },
                     fit_params={'sample_weight': [ my_sample_weights ]},
                     n_jobs=1,
                    )
validator.fit(x, y)

Trying to pass the sample_weights as a parameter gives the following error:

...
ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast.

It seems that the sample_weight vector has not been split in a similar manner to the input array.

For what is worth:

import sklearn
print(sklearn.__version__)
0.18.1

import keras
print(keras.__version__)
2.0.5
like image 788
Manuel Castejón Limas Avatar asked Oct 30 '22 06:10

Manuel Castejón Limas


1 Answers

The problem is that as a standard, the GridSearch uses 3-fold cross-validation, unless explicity stated otherwise. This means that 2/3 data points of the data are used as training data and 1/3 for cross-validation, which does fit the error message. The input shape of 1000 of the fit_params doesn't match the number of training examples used for training (666). Adjust the size and the code will run.

my_sample_weights = np.random.uniform(size=666)
like image 110
Sean Avatar answered Jan 02 '23 19:01

Sean