I have a corpus of documents and I want to represent each document as a vector. Basically, the vector would have 1 for words that are present inside a document and for other words (which are present in other documents in the corpus and not in this particular document) it would have a 0. How do I create this vector for all the documents in Weka?
Is there a quick way to do this using Weka? I also want Weka to remove stopwords and so some pre-processing if possible before it creates this vector.
Thanks Abhishek S
You want the StringToWordVector filter.
It has options for binary occurrence and stopping, amongst many others, such as stemming, truncating the word list, discarding infrequent terms, case folding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With