I am using glove embeddings and I am quite confused about tokens and vocab in the embeddings. Like this one:
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
what does tokens and vocab mean, respectively? What is the difference?
In NLP tokens refers to the total number of "words" in your corpus. I put words in quotes because the definition varies by task. The vocab is the number of unique "words".
It should be the case that vocab <= tokens.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With