Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spaCy Documentation for [ orth , pos , tag, lema and text ]

I am new to spaCy. I added this post for documentation and make it simple for new starters as me.

import spacy
nlp = spacy.load('en')
doc = nlp(u'KEEP CALM because TOGETHER We Rock !')
for word in doc:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
    print(word.orth_)

I am looking to understand what the meaning of orth, lemma, tag and pos ? This code print out the values also what the different between print(word) vs print(word.orth_)

like image 347
ahmed osama Avatar asked May 16 '17 00:05

ahmed osama


People also ask

What does Orth mean in spaCy?

orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy. tokens. doc.

How do you POS tag with spaCy?

Spacy POS Tagging Example We just instantiate a Spacy object as doc. We iterate over doc object and use pos_ , tag_, to print the POS tag. Spacy also lets you access the detailed explanation of POS tags by using spacy. explain() function which is also printed in the same iteration along with POS tags.

Which of the following classes processes a text into a doc object in spaCy?

The central data structures in spaCy are the Language class, the Vocab and the Doc object. The Language class is used to process a text and turn it into a Doc object. It's typically stored as a variable called nlp .


1 Answers

What the meaning of orth, lemma, tag and pos ?

See https://spacy.io/docs/usage/pos-tagging#pos-schemes

What the different between print(word) vs print(word.orth_)

In super short:

word.orth_ and word.text are the same. The fact that the cython property ends with an underscore, it's usually a variable that the developers didn't really want to expose to the user.

In short:

When you access the word.orth_ property at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L537, it tries to access the index of where all the vocabulary of words are kept:

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]

(For details, see In long below for explanation of self.c.lex.orth)

And word.text returns the string representation of the word which merely wraps around the orth_ property, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L128

property text:
    def __get__(self):
        return self.orth_

And when you're printing print(word), it calls the __repr__ dunder function that returns the word.__unicode__ or word.__byte__ which points back to the word.text variable, see https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L55

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

    def __hash__(self):
        return hash((self.doc, self.i))

    def __len__(self):
        """
        Number of unicode characters in token.text.
        """
        return self.c.lex.length

    def __unicode__(self):
        return self.text

    def __bytes__(self):
        return self.text.encode('utf8')

    def __str__(self):
        if is_config(python3=True):
            return self.__unicode__()
        return self.__bytes__()

    def __repr__(self):
        return self.__str__()

In long:

Let's try to walk through this step by step:

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp(u'This is a foo bar sentence.')
>>> type(doc)
<type 'spacy.tokens.doc.Doc'>

After the sentence is passed into the nlp() function, it produces a spacy.tokens.doc.Doc object, from the docs:

cdef class Doc:
    """
    A sequence of `Token` objects. Access sentences and named entities,
    export annotations to numpy arrays, losslessly serialize to compressed
    binary strings.
    Aside: Internals
        The `Doc` object holds an array of `TokenC` structs.
        The Python-level `Token` and `Span` objects are views of this
        array, i.e. they don't own the data themselves.
    Code: Construction 1
        doc = nlp.tokenizer(u'Some text')
    Code: Construction 2
        doc = Doc(nlp.vocab, orths_and_spaces=[(u'Some', True), (u'text', True)])
    """

So the spacy.tokens.doc.Doc object is a sequence of spacy.tokens.token.Token object. Within the Token object, we see a wave of cython property enumerated, e.g. at https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth:
    def __get__(self):
        return self.c.lex.orth

Tracing it back, we see that self.c = &self.doc.c[offset]:

cdef class Token:
    """
    An individual token --- i.e. a word, punctuation symbol, whitespace, etc.
    """
    def __cinit__(self, Vocab vocab, Doc doc, int offset):
        self.vocab = vocab
        self.doc = doc
        self.c = &self.doc.c[offset]
        self.i = offset

Without thorough documentation, we don't really know what self.c means but from the looks of it it's accessing one of the tokens within the &self.doc reference pointing to the Doc doc that was passed into the __cinit__ function. So most probably, it's a short cut to access the tokens

Looking at the Doc.c:

cdef class Doc:
    def __init__(self, Vocab vocab, words=None, spaces=None, orths_and_spaces=None):
        self.vocab = vocab
        size = 20
        self.mem = Pool()
        # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds
        # However, we need to remember the true starting places, so that we can
        # realloc.
        data_start = <TokenC*>self.mem.alloc(size + (PADDING*2), sizeof(TokenC))
        cdef int i
        for i in range(size + (PADDING*2)):
            data_start[i].lex = &EMPTY_LEXEME
            data_start[i].l_edge = i
            data_start[i].r_edge = i
        self.c = data_start + PADDING

Now we see that the Doc.c is referring to a cython pointer array data_start that allocates the memory on to store the spacy.tokens.doc.Doc object (please correct me if I get the explanation <TokenC*> wrong).

So going back to self.c = &self.doc.c[offset], it's basically trying to access the memory point where the array is stored and more specifically accessing the "offset-th" item in the array.

That's what spacy.tokens.token.Token is.


Going back to the property:

property orth:
    def __get__(self):
        return self.c.lex.orth

We see that the self.c.lex is accessing the data_start[i].lex from spacy.tokens.doc.Doc and self.c.lex.orth is simply an integer that indicates the index of the occurrence of the word that is kept in the spacy.tokens.doc.Doc internal vocabulary.

Thus, we see the property orth_ tries to access the self.vocab.strings with te index from self.c.lex.orth https://github.com/explosion/spaCy/blob/develop/spacy/tokens/token.pyx#L162

property orth_:
        def __get__(self):
            return self.vocab.strings[self.c.lex.orth]
like image 90
alvas Avatar answered Nov 15 '22 18:11

alvas