I'm aware of the basic spacy workflow for getting various attributes from a document, however I can't find a built in function to return the position (start/end) of a word which is part of a sentence.
Would anyone know if this is possible with Spacy?
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer.
In Spacy, the process of tokenizing a text into segments of words and punctuation is done in various steps. It processes the text from left to right. First, the tokenizer split the text on whitespace similar to the split() function. Then the tokenizer checks whether the substring matches the tokenizer exception rules.
The attribute ruler lets you set token attributes for tokens identified by Matcher patterns. The attribute ruler is typically used to handle exceptions for token attributes and to map values between attributes such as mapping fine-grained POS tags to coarse-grained POS tags. See the usage guide for examples.
Editable CodespaCy v3. Text: The original token text. Dep: The syntactic relation connecting child to head. Head text: The original text of the token head. Head POS: The part-of-speech tag of the token head. Children: The immediate syntactic dependents of the token.
These are available as attributes of the tokens in the sentences. Doc says:
idx int The character offset of the token within the parent document.
i int The index of the token within the parent document.
>>> import spacy
>>> nlp = spacy.load('en')
>>> parsed_sentence = nlp(u'This is my sentence')
>>> [(token.text,token.i) for token in parsed_sentence]
[(u'This', 0), (u'is', 1), (u'my', 2), (u'sentence', 3)]
>>> [(token.text,token.idx) for token in parsed_sentence]
[(u'This', 0), (u'is', 5), (u'my', 8), (u'sentence', 11)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With