spaCy SPECIAL-1 token overriding suffix rule causing annotation misalignment

Question

My text is like

'Laboratories, Inc.'

Which gets tokenized like

Laboratories     TOKEN
,    SUFFIX
Inc.     SPECIAL-1

However annotations usually don't include suffix characters like '.'

So I tried adding a suffix rule to tokenize the '.'

(r'[.]+$',)

But it does not work for strings like 'Inc.' or 'St.' which are tagged as SPECIAL-1 The problem is this and tokenization issue like this cause a substantial amount of annotations to be ignored due to these misalignment issue, significantly reducing valuable examples during training.

Any suggestion is appreciated.

aab · Accepted Answer

Tokenizer exceptions (also: special cases, rules) have priority over the other patterns, so you would need to remove the special cases you don't want.

nlp.tokenizer.rules contains the special cases, which you can modify. Remove all exceptions with periods, as an example:

new_rules = {}
for orth, exc in nlp.tokenizer.rules.items():
    if "." not in orth:
        new_rules[orth] = exc
nlp.tokenizer.rules = new_rules

spaCy SPECIAL-1 token overriding suffix rule causing annotation misalignment

Tags:

python

tokenize

rules

spacy

erotavlas

1 Answers

aab

Recent Activity

Donate For Us

spaCy SPECIAL-1 token overriding suffix rule causing annotation misalignment

Tags:

python

tokenize

rules

spacy

erotavlas

1 Answers

aab

Related questions

Recent Activity

Donate For Us