I am using glove embeddings and I am quite confused about tokens
and vocab
in the embeddings. Like this one:
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
what does tokens
and vocab
mean, respectively? What is the difference?
In NLP tokens refers to the total number of "words" in your corpus. I put words in quotes because the definition varies by task. The vocab is the number of unique "words".
It should be the case that vocab <= tokens.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With