Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the index of a token in a sentence in spaCy?

Is there an elegant way to get the index of a word/token in its sentence? I am aware of the attributes for tokens https://spacy.io/api/token#attributes The i attribute returns the index within the whole parent document. But the parent document can contain multiple sentences.

Example:

"This is an example. This is another example."

What I need is both "This" to be returned as index 0, both "is" to be returned as index 1 etc...

like image 761
Johannes Krämer Avatar asked Jun 07 '18 13:06

Johannes Krämer


1 Answers

A spaCy Doc object also lets you iterate over the doc.sents, which are Span objects of the individual sentence. To get a span's start and end index in the parent document you can look at the start and end attribute. So if you iterate over the sentences and subtract the sentence start index from the token.i, you get the token's relative index within the sentence:

for sent in doc.sents:
    for token in sent:
        print(token.text, token.i - sent.start)

The default sentence segmentation uses the dependency parse, which is usually more accurate. However, you can also plug in a rule-based or entirely custom solution (see here for details).

like image 156
Ines Montani Avatar answered Jan 03 '23 17:01

Ines Montani