Saving parts of a sklearn pipeline

Tags:

scikit-learn

Some of my features in a model can take some time to generate, so to experiment with multiple features and parameters quickly it's a good idea to save these to disk for later use.

As a concrete example (taken from here), suppose I have the following pipeline:

pipeline = Pipeline([
  ('extract_essays', EssayExractor()),
  ('features', FeatureUnion([
    ('ngram_tf_idf', Pipeline([
      ('counts', CountVectorizer()),
      ('tf_idf', TfidfTransformer())
    ])),
    ('essay_length', LengthTransformer()),
    ('misspellings', MispellingCountTransformer())
  ])),
  ('classifier', MultinomialNB())
])

And I would like to change CountVectorizer() to CountVectorizer(max_features=1000), then only CountVectorizer, MultinomialNB need to be recomputed since the parameter or the transform before it has changed.

Can this be implemented somehow?

315

asked Aug 09 '15 15:08

simonzack

1 Answers

I've had some success doing this sort of thing with Pachyderm. It has a somewhat git-like cli that will let you store your workflow. In the repo, take note of the ML pipeline for Iris Classification example which gives some details around how to create and save your pipeline & training data into what they call an "inference pipeline" that will allow the kinds of transformations you're attempting and apply the inferred pipeline training data.

110

answered Oct 05 '22 10:10

processoriented

Related questions
                            
                                Pandas DataFrame slices vs copies: which one is more memory friendly?
                            
                                Placing Python objects in shared memory
                            
                                Floating Point Exception with Numpy and PyTables
                            
                                How to implement redis's pubsub timeout feature?
                            
                                How can I dynamically update my matplotlib figure as the data file changes?
                            
                                Google Charts y-axis values
                            
                                Referencing parameters in a Python docstring
                            
                                How to accomplish query notification on SQL Server with python
                            
                                Lex strings with single, double, or triple quotes
                            
                                Adjusting axis range and tick marks in python ggplot
                            
                                Looking for a better strategy for an SQLAlchemy bulk upsert
                            
                                PyCharm: How can I use breakpoints in multithreaded code?
                            
                                Slicing a file in Python
                            
                                KDE is very slow with large data
                            
                                Python tkinter ttk one radio button appears the wrong size
                            
                                Spark MLlib - trainImplicit warning
                            
                                How to allow users to login protected flask-rest-api in Angularjs using HTTP Authentication?
                            
                                Variation of the backpack ... in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With