How to create a bag of words using Weka?

Question

I have a corpus of documents and I want to represent each document as a vector. Basically, the vector would have 1 for words that are present inside a document and for other words (which are present in other documents in the corpus and not in this particular document) it would have a 0. How do I create this vector for all the documents in Weka?

Is there a quick way to do this using Weka? I also want Weka to remove stopwords and so some pre-processing if possible before it creates this vector.

Thanks Abhishek S

Is there a quick way to do this using Weka? I also want Weka to remove stopwords and so some pre-processing if possible before it creates this vector.

Thanks Abhishek S

michaeltwofish · Accepted Answer

You want the StringToWordVector filter.

It has options for binary occurrence and stopping, amongst many others, such as stemming, truncating the word list, discarding infrequent terms, case folding.

How to create a bag of words using Weka?

Tags:

nlp

weka

London guy

1 Answers

michaeltwofish

Recent Activity

Donate For Us

How to create a bag of words using Weka?

Tags:

nlp

weka

London guy

1 Answers

michaeltwofish

Related questions

Recent Activity

Donate For Us