Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does tokens and vocab mean in glove embeddings?

Tags:

nlp

embedding

I am using glove embeddings and I am quite confused about tokens and vocab in the embeddings. Like this one:

Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)

what does tokens and vocab mean, respectively? What is the difference?

like image 558
Zhao Avatar asked Sep 06 '16 14:09

Zhao


1 Answers

In NLP tokens refers to the total number of "words" in your corpus. I put words in quotes because the definition varies by task. The vocab is the number of unique "words".

It should be the case that vocab <= tokens.

like image 93
aberger Avatar answered Oct 28 '22 13:10

aberger