Sentence split using spacy sentenizer

Question

I am using spaCy's sentencizer to split the sentences.

from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)

text="Please read the analysis. (You'll be amazed.)"
doc = nlp(text)

sents_list = []
for sent in doc.sents:
   sents_list.append(sent.text)

print(sents_list)
print([token.text for token in doc])

OUTPUT

['Please read the analysis. (', 
"You'll be amazed.)"]

['Please', 'read', 'the', 'analysis', '.', '(', 'You', "'ll", 'be', 
'amazed', '.', ')']

Tokenization is done correctly but I am not sure it's not splitting the 2nd sentence along with ( and taking this as an end in the first sentence.

piyush · Accepted Answer

I have tested below code with en_core_web_lg and en_core_web_sm model and performance for sm model are similar to using sentencizer. (lg model will hit the performance).

Below custom boundaries only works with sm model and behave different splitting with lg model.

nlp=spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ".(" or token.text == ").":
            doc[token.i+1].is_sent_start = True
        elif token.text == "Rs." or token.text == ")":
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)

for sent in doc.sents:
 print(sent.text)

Sentence split using spacy sentenizer

Tags:

python-3.x

nlp

spacy

piyush

1 Answers

piyush

Recent Activity

Donate For Us

Sentence split using spacy sentenizer

Tags:

python-3.x

nlp

spacy

piyush

1 Answers

piyush

Related questions

Recent Activity

Donate For Us