Interpreting the output of StringToWordVector() - Weka

Question

I'm trying to do a document classification using Weka java API.

Here is my directory structure of the data files.

+- text_example
|
+- class1
|  |
|  3 html files
|
+- class2
|   |
|   1 html file
|
+- class3
    |
    3 html files

I have the 'arff' file created with 'TextDirectoryLoader'. Then I use the StringToWordVector filter on the created arff file, with filter.setOutputWordCounts(true).

Below is a sample of the output once the filter is applied. I need to get few things clarified.

@attribute </form> numeric
@attribute </h1> numeric
.
.
@attribute earth numeric
@attribute easy numeric

This huge list should be the tokenization of the content of the initial html files. right?

Then I have,

@data
{1 2,3 2,4 1,11 1,12 7,..............}
{10 4,34 1,37 5,.......}
{2 1,5 6,6 16,...}
{0 class2,34 11,40 15,.....,4900 3,...
{0 class3,1 2,37 3,40 5....
{0 class3,1 2,31 20,32 17......
{0 class3,32 5,42 1,43 10.........

why there is no class attribute for the first 3 items? (it should have class1). what does the leading 0 means as in {0 class2,..}, {0 class3..}. It says, for instance, that in the 3rd html file in the class3 folder, the word identified by the integer 32 appears 5 times. Just to see how do I get the word (token) referred by 32?

How do I reduce the dimensionality of the feature vector? don't we need to make all the feature vectors the same size? (like consider only the say 100 most frequent terms from the training set and later when it comes to testing, consider the occurrence of only those 100 terms in test documents. Because, in this way what happens if we come up with a totally new word in the testing phase, will the classifier just ignore it?).

Am I missing something here? I'm new to Weka.

Also I really appreciate the help if someone can explain me how the classifier uses this vector created with StringToWordVector filter. (like creating the vocabulary with the training data, dimensionality reduction, are those happening inside the Weka code?)

Malhelo · Accepted Answer

The huge list of @attribute contains all the tokens derived from your input.
Your @data section is in the sparse format, that is for each attribute, the value is only stated if it is different from zero. For the first three lines, the class attribute is class1, you just can't see it (if it were unknown, you would see a 0 ? at the beginning of the first three lines). Why is that so? Weka internally represents nominal attributes (that includes classes) as doubles and starts counting at zero. So your three classes are internally: class1=0.0, class2=1.0, class3=2.0. As zero-values are not stated in the sparse format, you can't see the class in the first three lines. (Also see the section "Sparse ARFF files" on http://www.cs.waikato.ac.nz/ml/weka/arff.html)
To get the word/token represented by index n, you can either count or, if you have the Instances object, invoke attribute(n).name() on it. For that, n starts counting at 0.
To reduce dimensionality of the feature vector, there are a lot of options. If you only want to have the 100 most frequent terms, you stringToWordVector.setWordsToKeep(100). Note that this will try to keep 100 words of every class. If you do not want to keep 100 words per class, stringToWordVector.setDoNotOperateOnPerClassBasis(true). You will get slightly above 100 if there are several words with the same frequency, so the 100 is just a kind of target value.
As for the new words occuring in the test phase, I think that cannot happen because you have to hand the stringToWordVector all instances before classifying. I am not 100% sure on that one though, as I am using a two-class setup and I let StringToWordVector transform all my instances before telling the classifier anything about it.

I can generally recomment to you, to experiment with the Weka KnowledgeFlow tool to learn how to use the different classes. If you know how to do things there, you can use that knowledge for your Java code quite easily. Hope I was able to help you, although the answer is a bit late.

Interpreting the output of StringToWordVector() - Weka

Tags:

java

text

machine-learning

classification

weka

samsamara

1 Answers

Malhelo

Recent Activity

Donate For Us

Interpreting the output of StringToWordVector() - Weka

Tags:

java

text

machine-learning

classification

weka

samsamara

1 Answers

Malhelo

Related questions

Recent Activity

Donate For Us