Loading spacy models slows down running my unit tests. Is there a way to mock spacy models or Doc objects to speed up unit tests?
Example of a current slow tests
import spacy
nlp = spacy.load("en_core_web_sm")
def test_entities():
text = u"Google is a company."
doc = nlp(text)
assert doc.ents[0].text == u"Google"
Based on the docs my approach is
Constructing the Vocab and Doc manually and setting the entities as tuples.
from spacy.vocab import Vocab
from spacy.tokens import Doc
def test()
alphanum_words = u"Google Facebook are companies".split(" ")
labels = [u"ORG"]
words = alphanum_words + [u"."]
spaces = len(words) * [True]
spaces[-1] = False
spaces[-2] = False
vocab = Vocab(strings=(alphanum_words + labels))
doc = Doc(vocab, words=words, spaces=spaces)
def get_hash(text):
return vocab.strings[text]
entity_tuples = tuple([(get_hash(labels[0]), 0, 1)])
doc.ents = entity_tuples
assert doc.ents[0].text == u"Google"
Is there a cleaner more Pythonic solution for mocking spacy objects for unit tests for entities?
This is a great question actually! I'd say your instinct is definitely right: If all you need is a Doc
object in a given state and with given annotations, always create it manually wherever possible. And unless you're explicitly testing a statistical model, avoid loading it in your unit tests. It makes the tests slow, and it introduces too much unnecessary variance. This is also very much in line with the philosophy of unit testing: you want to be writing independent tests for one thing at a time (not one thing plus a bunch of third-party library code plus a statistical model).
Some general tips and ideas:
Doc
manually. Avoid loading models or Language
subclasses.doc.text
, you do not have to set the spaces
. In fact, I leave this out in about 80% of the tests I write, because it really only becomes relevant when you're putting the tokens back together.Doc
objects in your test suite, you could consider using a utility function, similar to the get_doc
helper we use in the spaCy test suite. (That function also shows you how the individual annotations are set manually, in case you need it.)Vocab
. Depending on what you're testing, you might want to explicitly use the English
vocab. In the spaCy test suite, we do this by setting up an en_vocab
fixture in the conftest.py
.doc.ents
to a list of tuples, you can also make it a list of Span
objects. This looks a bit more straightforward, is easier to read, and in spaCy v2.1+, you can also pass a string as a label:def test_entities(en_vocab):
doc = Doc(en_vocab, words=["Hello", "world"])
doc.ents = [Span(doc, 0, 1, label="ORG")]
assert doc.ents[0].text == "Hello"
English
, put them in a session-scoped fixture. This means that they'll only be loaded once per session instead of once per test. Language classes are lazy-loaded and may also take some time to load, depending on the data they contain. So you only want to do this once.# Note: You probably don't have to do any of this, unless you're testing your
# own custom models or language classes.
@pytest.fixture(scope="session")
def en_core_web_sm():
return spacy.load("en_core_web_sm")
@pytest.fixture(scope="session")
def en_lang_class():
lang_cls = spacy.util.get_lang_class("en")
return lang_cls()
def test(en_lang_class):
doc = en_lang_class("Hello world")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With