Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sentence split using spacy sentenizer

I am using spaCy's sentencizer to split the sentences.

from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)

text="Please read the analysis. (You'll be amazed.)"
doc = nlp(text)

sents_list = []
for sent in doc.sents:
   sents_list.append(sent.text)

print(sents_list)
print([token.text for token in doc])

OUTPUT

['Please read the analysis. (', 
"You'll be amazed.)"]

['Please', 'read', 'the', 'analysis', '.', '(', 'You', "'ll", 'be', 
'amazed', '.', ')']

Tokenization is done correctly but I am not sure it's not splitting the 2nd sentence along with ( and taking this as an end in the first sentence.

like image 421
piyush Avatar asked Oct 16 '22 10:10

piyush


1 Answers

I have tested below code with en_core_web_lg and en_core_web_sm model and performance for sm model are similar to using sentencizer. (lg model will hit the performance).

Below custom boundaries only works with sm model and behave different splitting with lg model.

nlp=spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ".(" or token.text == ").":
            doc[token.i+1].is_sent_start = True
        elif token.text == "Rs." or token.text == ")":
            doc[token.i+1].is_sent_start = False
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)

for sent in doc.sents:
 print(sent.text)
like image 73
piyush Avatar answered Oct 21 '22 07:10

piyush