Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spaCy SPECIAL-1 token overriding suffix rule causing annotation misalignment

My text is like

'Laboratories, Inc.'

Which gets tokenized like

Laboratories     TOKEN
,    SUFFIX
Inc.     SPECIAL-1

However annotations usually don't include suffix characters like '.'

So I tried adding a suffix rule to tokenize the '.'

(r'[.]+$',) 

But it does not work for strings like 'Inc.' or 'St.' which are tagged as SPECIAL-1 The problem is this and tokenization issue like this cause a substantial amount of annotations to be ignored due to these misalignment issue, significantly reducing valuable examples during training.

Any suggestion is appreciated.

like image 681
erotavlas Avatar asked Sep 02 '25 10:09

erotavlas


1 Answers

Tokenizer exceptions (also: special cases, rules) have priority over the other patterns, so you would need to remove the special cases you don't want.

nlp.tokenizer.rules contains the special cases, which you can modify. Remove all exceptions with periods, as an example:

new_rules = {}
for orth, exc in nlp.tokenizer.rules.items():
    if "." not in orth:
        new_rules[orth] = exc
nlp.tokenizer.rules = new_rules
like image 152
aab Avatar answered Sep 04 '25 22:09

aab



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!