I would like to know what is the difference between token and span in spaCy.
Also what is the main reason when we have to work with span? Why can't we simply use token to do any NLP? Specially when we use spaCy matcher?
Brief Background: My problem came up when I wanted to get index of span (its exact index in string doc not its ordered index in spaCy doc) after using spaCy matcher which returns 'match_id', 'start' and 'end', and so I could get span out of this information, not a token. Then I needed to create a training_data which requires exact index of word in a sentence. If I had access to token, I could simply use token.idx but span does not have that! So I have to write extra codes to find the index of word (which is the same as span) in its sentence!
Token
vs Span
From spaCy's documentation, a Token
represents a single word, punctuation symbol, whitespace, etc. from a document, while a Span
is a slice from the document. In other words, a Span
is an ordered sequence of Token
s.
Span
s?spaCy's Matcher
gives a Span
-level information rather than Token
-level, because it allows a sequence of Token
s to be matched. In the same way that a Span
can be composed of just 1 Token
, this isn't necessarily the case.
Consider the following example. Where we match for the Token
"hello"
on its own, the Token
"world"
on its own, and the Span
composed of the Token
s "hello"
& "world"
.
>>> import spacy
>>> nlp = spacy.load("en")
>>> from spacy.matcher import Matcher
>>> matcher = Matcher(nlp.vocab)
>>> matcher.add(1, None, [{"LOWER": "hello"}])
>>> matcher.add(2, None, [{"LOWER": "world"}])
>>> matcher.add(3, None, [{"LOWER": "hello"}, {"LOWER": "world"}])
For "Hello world!"
all of these patterns match:
>>> document = nlp("Hello world!")
>>> [(token.idx, token) for token in document]
[(0, Hello), (6, world), (11, !)]
>>> matcher(document)
[(1, 0, 1), (3, 0, 2), (2, 1, 2)]
However, the 3rd pattern doesn't match for "Hello, world!"
, since "Hello"
& "world"
aren't contiguous Token
s (because of the ","
Token
), so they don't form a Span
:
>>> document = nlp("Hello, world!")
>>> [(token.idx, token) for token in document]
[(0, Hello), (5, ,), (7, world), (12, !)]
>>> matcher(document)
[(1, 0, 1), (2, 2, 3)]
Token
s from Span
sDespite this, you should be able to get Token
-level information from the span by iterating over the Span
, the same way you could iterate over Token
s in a Doc
.
>>> document = nlp("Hello, world!")
>>> span, type(span)
(Hello, world, <class 'spacy.tokens.span.Span'>)
>>> [(token.idx, token, type(token)) for token in span]
[(0, Hello, <class 'spacy.tokens.token.Token'>), (5, ,, <class 'spacy.tokens.token.Token'>), (7, world, <class 'spacy.tokens.token.Token'>)]
You can access the tokens within a span just like it's a list:
import spacy
nlp = spacy.load('en')
text = "This is a sentence."
doc = nlp(text)
span = doc[2:4]
span_char_start = span[0].idx
span_char_end = span[-1].idx + len(span[-1].text)
assert text[span_char_start:span_char_end] == "a sentence"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With