Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Token extension versus matcher versus phrase matcher vs entity ruler in spaCy

I am trying to figure out the best way (fast) to extract entities, e.g. a month. I have come up with 5 different approaches using spaCy.

Initial setup

For each solution I start with an initial setup

import spacy.lang.en    
nlp = spacy.lang.en.English()
text = 'I am trying to extract January as efficient as possible. But what is the best solution?'

Solution: using extension attributes (limited to single token matching)

import spacy.tokens
NORM_EXCEPTIONS = {
    'jan': 'MONTH', 'january': 'MONTH'
}
spacy.tokens.Token.set_extension('norm', getter=lambda t: NORM_EXCEPTIONS.get(t.text.lower(), t.norm_))
def time_this():
    doc = nlp(text)
    assert [t for t in doc if t._.norm == 'MONTH'] == [doc[5]]

%timeit time_this()

76.4 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Solution: using phrase matcher via entity ruler

import spacy.pipeline
ruler = spacy.pipeline.EntityRuler(nlp)
ruler.phrase_matcher = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
ruler.add_patterns([{'label': 'MONTH', 'pattern': 'jan'}, {'label': 'MONTH', 'pattern': 'january'}])
nlp.add_pipe(ruler)
def time_this():
    doc = nlp(text)
    assert [t for t in doc.ents] == [doc[5:6]]
%timeit time_this()

131 µs ± 579 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Solution: using token matcher via entity ruler

import spacy.pipeline
ruler = spacy.pipeline.EntityRuler(nlp)
ruler.add_patterns([{'label': 'MONTH', 'pattern': [{'lower': {'IN': ['jan', 'january']}}]}])
nlp.add_pipe(ruler)
def time_this():
    doc = nlp(text)
    assert [t for t in doc.ents] == [doc[5:6]]
%timeit time_this()

72.6 µs ± 76.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Solution: using phrase matcher directly

import spacy.matcher
phrase_matcher = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
phrase_matcher.add('MONTH', None, nlp('jan'), nlp('january'))
def time_this():
    doc = nlp(text)
    matches = [m for m in filter(lambda x: x[0] == doc.vocab.strings['MONTH'], phrase_matcher(doc))]
    assert [doc[m[1]:m[2]] for m in matches] == [doc[5:6]]
%timeit time_this()

115 µs ± 537 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Solution: using token matcher directly

import spacy.matcher
matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add('MONTH', None, [{'lower': {'IN': ['jan', 'january']}}])
def time_this():
    doc = nlp(text)
    matches = [m for m in filter(lambda x: x[0] == doc.vocab.strings['MONTH'], matcher(doc))]
    assert [doc[m[1]:m[2]] for m in matches] == [doc[5:6]]
%timeit time_this()

55.5 µs ± 459 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Conclusion

The custom attributes with is limited to single token matching and the token matcher seems to be faster so that seems to be preferable. The EntityRuler seems to be the slowest which isn't surprising since it is changing the Doc.ents. It is however quite convenient that you have your matches in Doc.ents so you might want to consider this method still.

I was quite surprised that the token matcher outperforms the phrase matcher. I thought it would be opposite:

If you need to match large terminology lists, you can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall

Question

Am I missing something important here or can I trust this analysis on a larger scale?

like image 525
mr.bjerre Avatar asked Apr 25 '19 14:04

mr.bjerre


People also ask

What is entity ruler in spaCy?

The entity ruler lets you add spans to the Doc. ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system.

What is matcher in spaCy?

The Matcher lets you find words and phrases using rules describing their token attributes. Rules can refer to token annotations (like the text or part-of-speech tags), as well as lexical attributes like Token. is_punct . Applying the matcher to a Doc gives you access to the matched tokens in context.

What are entity types in spaCy?

SpaCy NER already supports the entity types like- PERSON People, including fictional. NORP Nationalities or religious or political groups. FAC Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc.

What are tokens in spaCy?

An individual token — i.e. a word, punctuation symbol, whitespace, etc.


1 Answers

I think ultimately, it all comes down to finding the optimal tradeoff between speed, maintainability of the code and the way this piece of logic fits into the larger picture of your application. Finding a few strings in a text is unlikely to be the end goal of what you're trying to do – otherwise, you probably wouldn't be using spaCy and would stick to regular expressions. How your application needs to "consume" the result of the matching and what the matches mean in the larger context should motivate the approach you choose.

As you mention in the conclusion, if your matches are "named entities" by definition, adding them to the doc.ents makes a lot of sense and will even give you an easy way to combine your logic with statistical predictions. Even if it adds slightly more overhead, it'll likely still outperform any scaffolding you'd otherwise have to write around it yourself.

For each solution I start with an initial setup

If you're running the experiments in the same session, e.g. in a notebook, you may want to include the creation of the Doc object in your initial setup. Otherwise, the caching of the vocabulary entries could theoretically mean that the very first call of nlp(text) is slower than the subsequent calls. It's likely insignificant, though.

I was quite surprised that the token matcher outperforms the phrase matcher. I thought it would be opposite

One potential explanation is that you're profiling the approaches on a very small scale and on single-token patterns where the phrase matcher engine doesn't really have an advantage over the regular token matcher. Another factor could be that matching on a different attribute (e.g. LOWER instead of TEXT/ORTH) requires creating a new Doc during matching that reflects the values of the matched attribute. This should be inexpensive, but it's still one extra object that gets created. So a test Doc "extract January" will actually become "extract january" (when matching on LOWER) or even "VERB PROPN" when matching on POS. That's the trick that makes matching on other attributes work.

Some background on how the PhraseMatcher works and why its mechanism is typically faster: When you add Doc objects to the PhraseMatcher, it sets flags on the tokens included in the patterns, indicating that they're matching a given pattern. It then calls into the regular Matcher and adds token-based patterns using the previously set flags. When you're matching, spaCy will only have to check the flags and not retrieve any token attributes – that's what should make the matching itself significantly faster at scale.

This actually brings up another approach you could be profiling for comparison: Using Vocab.add_flag to set a boolean flag on the respective lexeme (entry in the vocab, so not the context-sensitive token). Vocab entries are cached, so you should only have to compute the flag once for a lexeme like "january". However, this approach only really makes sense for single tokens, so it's relatively limiting.

Am I missing something important here or can I trust this analysis on a larger scale?

If you want to get any meanigful insights, you should be benchmarking on at least a medium-sized scale. You don't want to be looping over the same small example 10000 times and instead, benchmark on a dataset that you'll only be processing once per test. For instance, a few hundred documents similar to the data you're actually working with. There are caching effects (both within spaCy, but also your CPU), differences in memory allocation and so on that can all have an impact.

Finally, using spaCy's Cython API directly will always be the fastest. So if speed is your number one concern and all you want to optimise for, Cython would be the way to go.

like image 194
Ines Montani Avatar answered Nov 16 '22 01:11

Ines Montani