I am using spaCy's sentencizer to split the sentences.
from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)
text="Please read the analysis. (You'll be amazed.)"
doc = nlp(text)
sents_list = []
for sent in doc.sents:
sents_list.append(sent.text)
print(sents_list)
print([token.text for token in doc])
OUTPUT
['Please read the analysis. (',
"You'll be amazed.)"]
['Please', 'read', 'the', 'analysis', '.', '(', 'You', "'ll", 'be',
'amazed', '.', ')']
Tokenization is done correctly but I am not sure it's not splitting the 2nd sentence along with ( and taking this as an end in the first sentence.
I have tested below code with en_core_web_lg and en_core_web_sm model and performance for sm model are similar to using sentencizer. (lg model will hit the performance).
Below custom boundaries only works with sm model and behave different splitting with lg model.
nlp=spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ".(" or token.text == ").":
doc[token.i+1].is_sent_start = True
elif token.text == "Rs." or token.text == ")":
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
for sent in doc.sents:
print(sent.text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With