Examining the sklearn.base
, more specifically BaseEstimator
, and the different mixins, it is obvious that some of the mixins are dependent on the the ability to call .fit
or .predict
.
For example, if we'd look at the RegressorMixin
we'd see it relies on the .predict
method.
My question is why is there no implementation of an interface / abstract class that enforces the implementation of these methods?
I'd expect to have something like BaseRegressor
that has .predict()
as an abstract method and BaseClassifier
to have .predict_proba()
and .predict()
- or something similar
The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning.
Elsewhere features are known as attributes, predictors, regressors, or independent variables. Nearly all estimators in scikit-learn assume that features are numeric, finite and not missing, even when they have semantically distinct domains and distributions (categorical, ordinal, count-valued, real-valued, interval).
Base classes Base class for all estimators in scikit-learn. Mixin class for all bicluster estimators in scikit-learn. Mixin class for all classifiers in scikit-learn. Mixin class for all cluster estimators in scikit-learn.
The common idiom in python is 'duck typing' - if it behaves like a duck it's a duck, if it implements fit
or any other relevant function it's a model for sklearn
there's also the concept of abstract base classes, but it's usage is less common
see more here: https://en.wikipedia.org/wiki/Duck_typing
There are a few things which together make it probably more clear why things are done in a package like scikit-learn
, the way they are:
duck typing vs inheritance: you can find very long arguments about which one is a better approach, and while they both have their advantages and disadvantages, at the end of the day, it comes down to what people in a community are used to. As somebody who does a lot of Python these days, I love duck typing, and I'm very comfortable with it. At the same time, 15 years ago, I loved abstract classes and OOP and what not, and I wouldn't understand why you would do things any other way. What I'm trying to say, is that people in Python like duck typing and that's partly why you see the pattern very often in some of its core packages.
duck typing, contrib packages and extentions: sometimes checking an input, we can either check its type, or duck type it for a certain functionality. If we check the type, that means any input to that method should actually inherit from those classes, whereas if you duck type them, they can simply be implementing those methods and they're fine. This is important because if a developer is writing an estimator outside scikit-learn
, for instance, which they want to be compatible with certain parts of scikit-learn
, they don't have to depend on scikit-learn
as a dependency (because that's how they can then inherit a certain class from the package), and simply implement those methods. If developers have the constraints to keep their package and their dependencies lightweight, this becomes relevant (and we have seen these exact issues in scikit-learn
).
Mixin
classes: the idea behind the Mixin
classes is not really that the child classes should inherit them and implement their methods; but it's more about adding a functionality to existing classes through them without having to copy/paste or reimplement any method. For instance, the TransformerMixin
adds the fit_transform
method to an object, assuming it already has fit
and transform
, without caring about weather the object is an estimator or a transformer. Again, you could argue that a certain design pattern from OOP may be better here, but that's a never ending argument, and this approach works, and the developers are comfortable with it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With