I am researching about embedding input for Convolution Neural Network and I understand Word2vec. However, in CNN text classification. dennybritz used function learn.preprocessing.VocabularyProcessor
. In the document. They said it Maps documents to sequences of word ids. I am not quite sure how this function work. Does it creates a list of Ids then maps the Ids with Words or It has an dictionary of words and their Ids, when run function it only give the ids ?
Lets say that you have just two documents I like pizza
and I like Pasta
. Your whole vocabulary consists of these words (I, like, pizza, pasta)
For every word in the vocabulary, there is an index associated like so (1, 2, 3, 4). Now given a document like I like pasta
it can be converted into a vector [1, 2, 4]. This is what the learn.preprocessing.VocabularyProcessor
does. The parameter max_document_length
makes sure that all the documents are represented by a vector of length max_document_length
either by padding numbers if their length is shorter than max_document_length
and clipping them if their length is greater than max_document_length
Hope this helps you
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With