My text is like
'Laboratories, Inc.'
Which gets tokenized like
Laboratories TOKEN
, SUFFIX
Inc. SPECIAL-1
However annotations usually don't include suffix characters like '.'
So I tried adding a suffix rule to tokenize the '.'
(r'[.]+$',)
But it does not work for strings like 'Inc.' or 'St.' which are tagged as SPECIAL-1 The problem is this and tokenization issue like this cause a substantial amount of annotations to be ignored due to these misalignment issue, significantly reducing valuable examples during training.
Any suggestion is appreciated.
Tokenizer exceptions (also: special cases, rules) have priority over the other patterns, so you would need to remove the special cases you don't want.
nlp.tokenizer.rules
contains the special cases, which you can modify. Remove all exceptions with periods, as an example:
new_rules = {}
for orth, exc in nlp.tokenizer.rules.items():
if "." not in orth:
new_rules[orth] = exc
nlp.tokenizer.rules = new_rules
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With