I'm trying to understand the relationship between sklearn's .fit() method and the .predict() method; mainly, how exactly is data (typically) being passed from one to the other. I haven't found another question on SO that's addressed this, but have danced around it (i.e. here)
I've written a custom estimator, using the BaseEstimator and RegressorMixin classes, but have run into a 'NotFittedError' a handful of times as I've begun running my data through it. Could someone walk me through a simple linear regression and how the data is passed through the fit and predict methods? No need to get into the math - I understand how regressions work and what the pieces of the puzzle do. Maybe I'm overlooking the obvious and making it more complicated than it shoudld be? But the estimator methods are feeling like a bit of a black box.
NotFittedError happens when you try to use the .predict() method of your classifier before you have trained or used the .fit() method.
Lets take for example the LinearRegression from scikit learn.
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0000...
>>> reg.predict(np.array([[3, 5]]))
array([16.])
so with the line reg = LinearRegression().fit(X, y) you are instantiating the class LinearRegression and then fit it to your data X and y where X the independent variables and y your dependent. Once the model is trained inside that class the beta coefficients for the linear regression is saved in the class attribute coef_ and you could access it using reg.coef_. That's how the class knows to predict when you use the .predict() class method. The class accesses those coefficients and then its just simple algebra to produce a prediction.
So back to your error. If you aren't fitting the model to your training data then the class doesn't have the necessary attributes needed to make the predictions. Hopefully that clears up some confusion on whats going on inside the class at least with regards to how the fit() and predict() methods interact.
Ultimately like commented above this goes back to the fundamentals of Object Oriented Programming so if you wanted to learn further I would read about how Python handles Classes as scikit learn models follow the same behavior
Lets look at a toy Estimator doing the LinearRegression
from sklearn.base import TransformerMixin, BaseEstimator
import numpy as np
class ToyEstimator(BaseEstimator):
def __init__(self):
pass
def fit(self, X, y):
X = np.hstack((X,np.ones((len(X),1))))
self.W = np.random.randn(X.shape[1])
self.W = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)), X.T), y)
self.coef_ = self.W[:-1]
self.intercept_ = self.W[-1]
return self
def transform(self, X):
X = np.hstack((X,np.ones((len(X),1))))
return np.dot(X,self.W)
X = np.random.randn(10,3)
y = X[:,0]*1.11+X[:,1]*2.22+X[:,2]*3.33+4.44
reg = ToyEstimator()
reg.fit(X,y)
y_ = reg.transform(X)
print (reg.coef_, reg.intercept_)
Output:
[1.11 2.22 3.33] 4.4399999999999995
So what did the above code do ?
fit we fit\train the weights using the training data. These weights are member variables of the class [this is something which you learn in OOPs]transform method makes a prediction on the data using the trained weighs which are stored as member variables. So before calling transform you need to call fit because transform uses the weights which are calculated during fit.
In sklearn modules if you call a transform before fit you get a NotFittedError exception.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With