Can I safely assign to `coef_` and other estimated parameters in scikit-learn?

Tags:

scikit-learn suggests the use of pickle for model persistence. However, they note the limitations of pickle when it comes to different version of scikit-learn or python. (See also this stackoverflow question)

In many machine learning approaches, only few parameters are learned from large data sets. These estimated parameters are stored in attributes with trailing underscore, e.g. coef_

Now my question is the following: Can model persistence be achieved by persisting the estimated attributes and assigning to them later? Is this approach safe for all estimators in scikit-learn, or are there potential side-effects (e.g. private variables that have to be set) in the case of some estimators?

It seems to work for logistic regression, as seen in the following example:

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
try:
    from sklearn.model_selection import train_test_split
except ImportError:
    from sklearn.cross_validation import train_test_split
iris = datasets.load_iris()
tt_split = train_test_split(iris.data, iris.target, test_size=0.4)
X_train, X_test, y_train, y_test = tt_split

# Here we train the logistic regression
lr = LogisticRegression(class_weight='balanced')
lr.fit(X_train, y_train)
print(lr.score(X_test, y_test))     # prints 0.95

# Persisting
params = lr.get_params()
coef = lr.coef_
intercept = lr.intercept_
# classes_ is not documented as public member, 
# but not explicitely private (not starting with underscore)
classes = lr.classes_ 
lr.n_iter_ #This is meta-data. No need to persist


# Now we try to load the Classifier 
lr2 = LogisticRegression()
lr2.set_params(**params)
lr2.coef_ = coef
lr2.intercept_ = intercept
lr2.classes_ = classes
print(lr2.score(X_test, y_test)) #Prints the same: 0.95

347

asked Sep 20 '17 07:09

Bernhard

1 Answers

Setting the estimated attributes alone is not enough - at least in the general case for all estimators.

I know of at least one example where this would fail. LinearDiscriminantAnalysis.transform() makes use of the private attribute _max_components:

def transform(self, X):
        # ... code omitted
        return X_new[:, :self._max_components]

However, it might work for some estimators. If you only need this for a specific estimator the best approach would be to look at the estimators source code and save all attributes that are set in the __init__() and .fit() methods.

A more generic approach could be to save all items in the estimator's .__dict__. E.g.:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA().fit([[1, 2, 3], [1, 2, 1], [4, 5, 6], [9, 9, 9]], [1, 2, 1, 2])
lda.__dict__
# {'_max_components': 1,
#  'classes_': array([1, 2]),
#  'coef_': array([[ -9.55555556,  21.55555556,  -9.55555556]]),
#  'explained_variance_ratio_': array([ 1.]),
#  'intercept_': array([-15.77777778]),
#  'means_': array([[ 2.5,  3.5,  4.5],
#         [ 5. ,  5.5,  5. ]]),
#  'n_components': None,
#  'priors': None,
#  'priors_': array([ 0.5,  0.5]),
#  'scalings_': array([[-2.51423299],
#         [ 5.67164186],
#         [-2.51423299]]),
#  'shrinkage': None,
#  'solver': 'svd',
#  'store_covariance': False,
#  'tol': 0.0001,
#  'xbar_': array([ 3.75,  4.5 ,  4.75])}

This won't be trivial for estimators that contain more complex data, such as ensembles that contain multiple estimators. See the blog post Scikit-learn Pipeline Persistence and JSON Serialization for more details.

Unfortunately, this will not safely carry estimators over to new versions of scikit-learn. Private attributes are essentially an implementation detail that could change anytime between releases.

117

answered Sep 21 '22 09:09

MB-F

Related questions
                            
                                Apply custom function to cells of selected columns of a data frame in PySpark
                            
                                Large Numpy Scipy CSR Matrix, row wise operation
                            
                                Tokenize() in nltk.TweetTokenizer returning integers by splitting
                            
                                How to efficiently fill an incomplete pandas dataframe consisting of pairwise combinations of values in a list?
                            
                                Why is an IndentationError being raised here rather than a SyntaxError?
                            
                                OSError: [WinError 123] The filename, directory name, or volume label syntax is incorrect
                            
                                Python unit testing Class properties
                            
                                How can I make `pip search` work with my local pypi server?
                            
                                How to introspect a function defined in a Cython C extension module
                            
                                Arbirtrary non-linear colorbar using Matplotlib
                            
                                pip install from custom whl file in Dockerfile
                            
                                Python - slice array at different position on every row
                            
                                Best practices python classes
                            
                                TypeError: Unrecognized keyword arguments: {'show_accuracy': True} #yelp challenge dataset
                            
                                Matplotlib arrow in loglog plot
                            
                                Pandas sorting MultiIndex after concatenate
                            
                                Why is it "pip2" instead of "pip" after installed python with brew?
                            
                                Find minimum non-negative integer, which not satisfies the condition
                            
                                Python : class definition with **kwargs
                            
                                Elegant resample for groups in Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can I safely assign to `coef_` and other estimated parameters in scikit-learn?

Tags:

python

scikit-learn

Bernhard

People also ask

1 Answers

MB-F

Recent Activity

Donate For Us