I am new to using spacy. I have a scenario where I have to get the index where the sentence starts and ends in the sentence. If I use doc. sents then I get a list of sents. sent.beg and sent.end prints the token index but I want the character index.
for sent in doc.sents:
print(sent.start,sent.end) #prints token index
Example:
completeText = "Hi, I am using StackOverflow. The community is great."
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp(completeText)
for sent in doc.sents:
print(sent.start,sent.end) #prints 0,7 and 7,12 the token indices
The above print statement prints only token indices, not character indices. My desired output is 0,29 and 30, 54.
I have tried getting the length of the sentence as below. I have added an if statement in the last because space after a full stop is ignored in the sentences.
start = [0] * len(list(doc.sents))
end = [0] * len(list(doc.sents))
for index, i in enumerate(doc.sents):
if index !=0:
start[index] = end[index-1] + 1
length += len(str(i))
if index == 0:
end[index] = length
else:
end[index] = length
if end[index] + 1 < len(sent) and sent[end[index]+1] == " ":
length += 1
This works fine when there are only spaces after a full stop. But in the complete text I have(more than 10,000 lines) I am not getting the correct answer. Does spacy ignore any other characters like mentioned above for including in sents?
Is there a better way to do this?
You can just use start_char and end_char.
for sent in doc.sents:
print(sent.start_char,sent.end_char)
A sentence is a Span in spaCy and comes with many useful attributes, which are covered in the docs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With