Spacy - Save custom pipeline

Tags:

I'm trying to integrate a custom PhraseMatcher() component into my nlp pipeline in a way that will allow me to load the custom Spacy model without having to re-add my custom components to a generic model on each load.

How can I load a Spacy model containing custom pipeline components?

I create the component, add it to my pipeline and save it with the following:

Click to copy

import requests
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token

class RESTCountriesComponent(object):
    name = 'countries'
    def __init__(self, nlp, label='GPE'):
        self.countries = [u'MyCountry', u'MyOtherCountry']
        self.label = nlp.vocab.strings[label]
        patterns = [nlp(c) for c in self.countries]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('COUNTRIES', None, *patterns)        
    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for _, start, end in matches:
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
        doc.ents = list(doc.ents) + spans
        for span in spans:
            span.merge()
        return doc

nlp = English()
rest_countries = RESTCountriesComponent(nlp)
nlp.add_pipe(rest_countries)
nlp.to_disk('myNlp')

I then attempt to load my model with,

Click to copy

nlp = spacy.load('myNlp')

But get this error message:

KeyError: u"[E002] Can't find factory for 'countries'. This usually happens when spaCy calls nlp.create_pipe with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to Language.factories['countries'] or remove it from the model meta and add it via nlp.add_pipe instead."

I can't just add my custom components to a generic pipeline in my programming environment. How can I do what I'm trying to do?

823

asked Jul 18 '18 22:07

Chris C

2 Answers

When you save out your model, spaCy will serialize all data and store a reference to your pipeline in the model's meta.json. For example: ["ner", "countries"]. When you load your model back in, spaCy will check out the meta and initialise each pipeline component by looking it up in the so-called "factories": functions that tell spaCy how to construct a pipeline component. (The reason for that is that you usually don't want your model to store and eval arbitrary code when you load it back in – at least not by default.)

In your case, spaCy is trying to look up the component name 'countries' in the factories and fails, because it's not built-in. The Language.factories are a simple dictionary, though, so you can customise it and add your own entries:

Click to copy

from spacy.language import Language
Language.factories['countries'] = lambda nlp, **cfg: RESTCountriesComponent(nlp, **cfg)

A factory is a function that receives the shared nlp object and optional keyword arguments (config parameters). It then initialises the component and returns it. If you add the above code before you load your model, it should load as expected.

More advanced approaches

If you want this taken care of automatically, you could also ship your component with your model. This requires wrapping it as a Python package using the spacy package command, which creates all required Python files. By default, the __init__.py only includes a function to load your model – but you can also add custom functions to it or use it to add entries to spaCy's factories.

As of v2.1.0 (currently available as a nightly version for testing), spaCy will also support providing pipeline component factories via Python entry points. This is especially useful for production setups and/or if you want to modularise your individual components and split them into their own packages. For example, you could create a Python package for your countries component and its factory, upload it to PyPi, version it and test it separately. In its setup.py, your package can define the spaCy factories it exposes and where to find them. spaCy will be able to detect them automatically – all you need to do is install the package in the same environment. Your model package could even require your component package as a dependency so it's installed automatically when you install your model.

answered Nov 15 '22 18:11

Ines Montani

This same issue came up for me and these are the steps I used:

1) Save pipeline after running notebook containing all the different nlp pipeline components e.g. nlp.to_disc('pipeline_model_name')
2) Build Package saved pipeline with Spacy: run python setup.py sdist in this directory.
3) Pip install the created package
4) Put custom components in __init__.py file of package as instructed above
4) Load pipeline with:
- Import spacy
- nlp = spacy_package.load()

answered Nov 15 '22 19:11

Aus_10

Related questions
                            
                                Arrow properties in matplotlib annotate
                            
                                error using plotly on pycharm
                            
                                How can I compute the absolute sum with a groupby in pandas?
                            
                                How to make sklearn.metrics.confusion_matrix() to always return TP, TN, FP, FN?
                            
                                Rotated image coordinates after scipy.ndimage.interpolation.rotate?
                            
                                How can I print the Learning Rate at each epoch with Adam optimizer in Keras?
                            
                                Tensorflow LinearRegressor Feature Cannot have rank 0
                            
                                drop unused categories using groupby on categorical variable in pandas
                            
                                Remove duplicates from rows and columns (cell) in a dataframe, python
                            
                                Boto 3 DynamoDB batchWriteItem Invalid attribute value type when specifying types
                            
                                wxPython: This program needs access to the screen
                            
                                How to mock AWS DynamoDB service?
                            
                                Error in Django when using matplotlib examples
                            
                                Python Pandas - How to write in a specific column in an Excel Sheet
                            
                                How to generate python class files from protobuf
                            
                                Show more images in Tensorboard - Tensorflow object detection
                            
                                Find first non-zero value in each column of pandas DataFrame
                            
                                What is the best way to show data in a table in Tkinter?
                            
                                Python: Barplot with colorbar
                            
                                Scikit-learn multithreading

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spacy - Save custom pipeline

Tags:

python

nlp

spacy

Chris C

People also ask

2 Answers

More advanced approaches

Ines Montani

Aus_10

Recent Activity

Donate For Us