Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a bag of words using Weka?

Tags:

nlp

weka

I have a corpus of documents and I want to represent each document as a vector. Basically, the vector would have 1 for words that are present inside a document and for other words (which are present in other documents in the corpus and not in this particular document) it would have a 0. How do I create this vector for all the documents in Weka?

Is there a quick way to do this using Weka? I also want Weka to remove stopwords and so some pre-processing if possible before it creates this vector.

Thanks Abhishek S

like image 639
London guy Avatar asked Oct 10 '11 07:10

London guy


1 Answers

You want the StringToWordVector filter.

It has options for binary occurrence and stopping, amongst many others, such as stemming, truncating the word list, discarding infrequent terms, case folding.

like image 146
michaeltwofish Avatar answered Sep 29 '22 06:09

michaeltwofish