Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving parts of a sklearn pipeline

Some of my features in a model can take some time to generate, so to experiment with multiple features and parameters quickly it's a good idea to save these to disk for later use.

As a concrete example (taken from here), suppose I have the following pipeline:

pipeline = Pipeline([
  ('extract_essays', EssayExractor()),
  ('features', FeatureUnion([
    ('ngram_tf_idf', Pipeline([
      ('counts', CountVectorizer()),
      ('tf_idf', TfidfTransformer())
    ])),
    ('essay_length', LengthTransformer()),
    ('misspellings', MispellingCountTransformer())
  ])),
  ('classifier', MultinomialNB())
])

And I would like to change CountVectorizer() to CountVectorizer(max_features=1000), then only CountVectorizer, MultinomialNB need to be recomputed since the parameter or the transform before it has changed.

Can this be implemented somehow?

like image 315
simonzack Avatar asked Aug 09 '15 15:08

simonzack


People also ask

What's the difference between pipeline () and Make_pipeline () from Sklearn library?

The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).

How does pipeline work in Sklearn?

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.


1 Answers

I've had some success doing this sort of thing with Pachyderm. It has a somewhat git-like cli that will let you store your workflow. In the repo, take note of the ML pipeline for Iris Classification example which gives some details around how to create and save your pipeline & training data into what they call an "inference pipeline" that will allow the kinds of transformations you're attempting and apply the inferred pipeline training data.

like image 110
processoriented Avatar answered Oct 05 '22 10:10

processoriented