SpaCy Parenthesis tokenization: pairs of (LRB, RRB) not tokenized correctly

Question

When RRB is not separated by a space with its following word, it will be recognized as part of the word.

In [34]: nlp("Indonesia (CNN)AirAsia ")                                                               
Out[34]: Indonesia (CNN)AirAsia 

In [35]: d=nlp("Indonesia (CNN)AirAsia ")                                                             

In [36]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]                                              
Out[36]: 
[('Indonesia', 'Indonesia', 'PROPN', 'NNP'),
 ('(', '(', 'PUNCT', '-LRB-'),
 ('CNN)AirAsia', 'CNN)AirAsia', 'PROPN', 'NNP')]

In [39]: d=nlp("(CNN)Police")                                                                         

In [40]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]                                              
Out[40]: [('(', '(', 'PUNCT', '-LRB-'), ('CNN)Police', 'cnn)police', 'VERB', 'VB')]

Expected result is

In [37]: d=nlp("(CNN) Police")                                                                        

In [38]: [(t.text, t.lemma_, t.pos_, t.tag_) for t in d]                                              
Out[38]: 
[('(', '(', 'PUNCT', '-LRB-'),
 ('CNN', 'CNN', 'PROPN', 'NNP'),
 (')', ')', 'PUNCT', '-RRB-'),
 ('Police', 'Police', 'NOUN', 'NNS')]

Is this a bug? Any suggestions to fix the issue?

Wiktor Stribiżew · Accepted Answer

Use a custom tokenizer to add the r'\b\)\b' rule (see this regex demo) to infixes. The regex matches a ) that is preceded with any word char (letter, digit, _, and in Python 3, some other rare characters) and is followed with this type of char.

You may customize this regex further, so a lot depends on what context you want to match the ) in.

See the full Python demo:

import spacy
import re
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex

nlp = spacy.load('en_core_web_sm')

def custom_tokenizer(nlp):
    infixes = tuple([r"\b\)\b"]) +  nlp.Defaults.infixes
    infix_re = spacy.util.compile_infix_regex(infixes)
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                suffix_search=suffix_re.search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp("Indonesia (CNN)AirAsia ")

print([(t.text, t.lemma_, t.pos_, t.tag_) for t in doc] )

Output:

[('Indonesia', 'Indonesia', 'PROPN', 'NNP'), ('(', '(', 'PUNCT', '-LRB-'), ('CNN', 'CNN', 'PROPN', 'NNP'), (')', ')', 'PUNCT', '-RRB-'), ('AirAsia', 'AirAsia', 'PROPN', 'NNP')]

SpaCy Parenthesis tokenization: pairs of (LRB, RRB) not tokenized correctly

Tags:

python

spacy

Vimos

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

SpaCy Parenthesis tokenization: pairs of (LRB, RRB) not tokenized correctly

Tags:

python

spacy

Vimos

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us