Some of my features in a model can take some time to generate, so to experiment with multiple features and parameters quickly it's a good idea to save these to disk for later use.
As a concrete example (taken from here), suppose I have the following pipeline:
pipeline = Pipeline([
('extract_essays', EssayExractor()),
('features', FeatureUnion([
('ngram_tf_idf', Pipeline([
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer())
])),
('essay_length', LengthTransformer()),
('misspellings', MispellingCountTransformer())
])),
('classifier', MultinomialNB())
])
And I would like to change CountVectorizer()
to CountVectorizer(max_features=1000)
, then only CountVectorizer
, MultinomialNB
need to be recomputed since the parameter or the transform before it has changed.
Can this be implemented somehow?
The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.
I've had some success doing this sort of thing with Pachyderm. It has a somewhat git-like cli that will let you store your workflow. In the repo, take note of the ML pipeline for Iris Classification example which gives some details around how to create and save your pipeline & training data into what they call an "inference pipeline" that will allow the kinds of transformations you're attempting and apply the inferred pipeline training data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With