How to get the index of a token in a sentence in spaCy?

Question

Is there an elegant way to get the index of a word/token in its sentence? I am aware of the attributes for tokens https://spacy.io/api/token#attributes The i attribute returns the index within the whole parent document. But the parent document can contain multiple sentences.

Example:

"This is an example. This is another example."

What I need is both "This" to be returned as index 0, both "is" to be returned as index 1 etc...

Example:

"This is an example. This is another example."

What I need is both "This" to be returned as index 0, both "is" to be returned as index 1 etc...

Ines Montani · Accepted Answer

A spaCy Doc object also lets you iterate over the doc.sents, which are Span objects of the individual sentence. To get a span's start and end index in the parent document you can look at the start and end attribute. So if you iterate over the sentences and subtract the sentence start index from the token.i, you get the token's relative index within the sentence:

for sent in doc.sents:
    for token in sent:
        print(token.text, token.i - sent.start)

The default sentence segmentation uses the dependency parse, which is usually more accurate. However, you can also plug in a rule-based or entirely custom solution (see here for details).

How to get the index of a token in a sentence in spaCy?

Tags:

nlp

spacy

dependency-parsing

Johannes Krämer

1 Answers

Ines Montani

Recent Activity

Donate For Us

How to get the index of a token in a sentence in spaCy?

Tags:

nlp

spacy

dependency-parsing

Johannes Krämer

1 Answers

Ines Montani

Related questions

Recent Activity

Donate For Us