Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How the get the indices of the sents's beginning and ending in spacy?

Tags:

python

nlp

spacy

I am new to using spacy. I have a scenario where I have to get the index where the sentence starts and ends in the sentence. If I use doc. sents then I get a list of sents. sent.beg and sent.end prints the token index but I want the character index.

for sent in doc.sents:
    print(sent.start,sent.end)     #prints token index

Example:

completeText = "Hi, I am using StackOverflow. The community is great."
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(nlp.create_pipe('sentencizer'))
doc = nlp(completeText)
for sent in doc.sents:
    print(sent.start,sent.end)  #prints 0,7 and 7,12 the token indices

The above print statement prints only token indices, not character indices. My desired output is 0,29 and 30, 54.

I have tried getting the length of the sentence as below. I have added an if statement in the last because space after a full stop is ignored in the sentences.

start = [0] * len(list(doc.sents))
end = [0] * len(list(doc.sents))
for index, i in enumerate(doc.sents):

    if index !=0:
        start[index] = end[index-1] + 1

    length += len(str(i))

    if index == 0:
         end[index] = length
    else:
        end[index] = length 
    if end[index] + 1 < len(sent) and sent[end[index]+1] == " ":        
        length += 1

This works fine when there are only spaces after a full stop. But in the complete text I have(more than 10,000 lines) I am not getting the correct answer. Does spacy ignore any other characters like mentioned above for including in sents?

Is there a better way to do this?

like image 533
eoeo Avatar asked Oct 31 '25 09:10

eoeo


1 Answers

You can just use start_char and end_char.

for sent in doc.sents:
    print(sent.start_char,sent.end_char) 

A sentence is a Span in spaCy and comes with many useful attributes, which are covered in the docs.

like image 187
polm23 Avatar answered Nov 02 '25 22:11

polm23