Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vocabulary Processor function

I am researching about embedding input for Convolution Neural Network and I understand Word2vec. However, in CNN text classification. dennybritz used function learn.preprocessing.VocabularyProcessor. In the document. They said it Maps documents to sequences of word ids. I am not quite sure how this function work. Does it creates a list of Ids then maps the Ids with Words or It has an dictionary of words and their Ids, when run function it only give the ids ?

like image 238
ngoduyvu Avatar asked Oct 03 '16 05:10

ngoduyvu


1 Answers

Lets say that you have just two documents I like pizza and I like Pasta. Your whole vocabulary consists of these words (I, like, pizza, pasta) For every word in the vocabulary, there is an index associated like so (1, 2, 3, 4). Now given a document like I like pasta it can be converted into a vector [1, 2, 4]. This is what the learn.preprocessing.VocabularyProcessor does. The parameter max_document_length makes sure that all the documents are represented by a vector of length max_document_length either by padding numbers if their length is shorter than max_document_length and clipping them if their length is greater than max_document_length Hope this helps you

like image 153
Kashyap Avatar answered Nov 07 '22 11:11

Kashyap