Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between token and span (a slice from a doc) in spaCy?

I would like to know what is the difference between token and span in spaCy.

Also what is the main reason when we have to work with span? Why can't we simply use token to do any NLP? Specially when we use spaCy matcher?

Brief Background: My problem came up when I wanted to get index of span (its exact index in string doc not its ordered index in spaCy doc) after using spaCy matcher which returns 'match_id', 'start' and 'end', and so I could get span out of this information, not a token. Then I needed to create a training_data which requires exact index of word in a sentence. If I had access to token, I could simply use token.idx but span does not have that! So I have to write extra codes to find the index of word (which is the same as span) in its sentence!

like image 463
pedrum Avatar asked Nov 15 '19 11:11

pedrum


2 Answers

Token vs Span

From spaCy's documentation, a Token represents a single word, punctuation symbol, whitespace, etc. from a document, while a Span is a slice from the document. In other words, a Span is an ordered sequence of Tokens.

Why Spans?

spaCy's Matcher gives a Span-level information rather than Token-level, because it allows a sequence of Tokens to be matched. In the same way that a Span can be composed of just 1 Token, this isn't necessarily the case.

Consider the following example. Where we match for the Token "hello" on its own, the Token "world" on its own, and the Span composed of the Tokens "hello" & "world".

>>> import spacy
>>> nlp = spacy.load("en")
>>> from spacy.matcher import Matcher
>>> matcher = Matcher(nlp.vocab)
>>> matcher.add(1, None, [{"LOWER": "hello"}])
>>> matcher.add(2, None, [{"LOWER": "world"}])
>>> matcher.add(3, None, [{"LOWER": "hello"}, {"LOWER": "world"}])

For "Hello world!" all of these patterns match:

>>> document = nlp("Hello world!")
>>> [(token.idx, token) for token in document]
[(0, Hello), (6, world), (11, !)]
>>> matcher(document)
[(1, 0, 1), (3, 0, 2), (2, 1, 2)]

However, the 3rd pattern doesn't match for "Hello, world!", since "Hello" & "world" aren't contiguous Tokens (because of the "," Token), so they don't form a Span:

>>> document = nlp("Hello, world!")
>>> [(token.idx, token) for token in document]
[(0, Hello), (5, ,), (7, world), (12, !)]
>>> matcher(document)
[(1, 0, 1), (2, 2, 3)]

Accessing Tokens from Spans

Despite this, you should be able to get Token-level information from the span by iterating over the Span, the same way you could iterate over Tokens in a Doc.

>>> document = nlp("Hello, world!")
>>> span, type(span)
(Hello, world, <class 'spacy.tokens.span.Span'>)
>>> [(token.idx, token, type(token)) for token in span]
[(0, Hello, <class 'spacy.tokens.token.Token'>), (5, ,, <class 'spacy.tokens.token.Token'>), (7, world, <class 'spacy.tokens.token.Token'>)]
like image 58
KurtMica Avatar answered Oct 12 '22 02:10

KurtMica


You can access the tokens within a span just like it's a list:

import spacy
nlp = spacy.load('en')
text = "This is a sentence."
doc = nlp(text)
span = doc[2:4]
span_char_start = span[0].idx
span_char_end = span[-1].idx + len(span[-1].text)
assert text[span_char_start:span_char_end] == "a sentence"
like image 45
aab Avatar answered Oct 12 '22 01:10

aab