I've been learning and practicing sklearn library on my own. When I participated Kaggle competitions, I noticed the provided sample code used BaseEstimator
from sklearn.base
. I don't quite understand how/why is BaseEstimator
used.
from sklearn.base import BaseEstimator class FeatureMapper: def __init__(self, features): self.features = features #features contains feature_name, column_name, and extractor( which is CountVectorizer) def fit(self, X, y=None): for feature_name, column_name, extractor in self.features: extractor.fit(X[column_name], y) #my question is: is X features? if yes, where is it assigned? or else how can X call column_name by X[column_name]. ...
This is what I usually see on sklearn's tutorial page:
from sklearn import SomeClassifier X = [[0, 0], [1, 1],[2, 2],[3, 3]] Y = [0, 1, 2, 3] clf = SomeClassifier() clf = clf.fit(X, Y)
I couldn't find a good example or any documentations on sklearn's official page. Although I found the sklearn.base
code on github, but I'd like some examples and explanation of how is it used.
UPDATE
Here is the link for the sample code: https://github.com/benhamner/JobSalaryPrediction/blob/master/features.py Correction: I just realized BaseEstimator
is used for the class SimpleTransform
. I guess my first question is why is it needed? (because it's not used anywhere in the computation), the other question is when define fit, what is X, and how is assigned? Because usually I see:
def mymethod(self, X, y=None): X=self.features # then do something to X[Column_name]
BaseEstimator[source] Base class for all estimators in scikit-learn. Notes. All estimators should specify all the parameters that can be set at the class level in their __init__ as explicit keyword arguments (no *args or **kwargs ).
The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning. Note that the model is fitted using X and y , but the object holds no reference to X and y .
Fitting data: the main API implemented by scikit-learn is that of the estimator . An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.
fit method takes two parameters, the list of points and another list of just y coordinates. X are your data samples, where each row is a datapoint (one sample, a N-dimensional feature vector). y are the datapoint labels, one per datapoint.
BaseEstimator
provides among other things a default implementation for the get_params
and set_params
methods, see [the source code]. This is useful to make the model grid search-able with GridSearchCV
for automated parameters tuning and behave well with others when combined in a Pipeline
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With