Is there an elegant way to get the index of a word/token in its sentence?
I am aware of the attributes for tokens https://spacy.io/api/token#attributes
The i
attribute returns the index within the whole parent document. But the parent document can contain multiple sentences.
Example:
"This is an example. This is another example."
What I need is both "This"
to be returned as index 0
, both "is"
to be returned as index 1
etc...
A spaCy Doc
object also lets you iterate over the doc.sents
, which are Span
objects of the individual sentence. To get a span's start and end index in the parent document you can look at the start
and end
attribute. So if you iterate over the sentences and subtract the sentence start index from the token.i
, you get the token's relative index within the sentence:
for sent in doc.sents:
for token in sent:
print(token.text, token.i - sent.start)
The default sentence segmentation uses the dependency parse, which is usually more accurate. However, you can also plug in a rule-based or entirely custom solution (see here for details).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With