I'm trying to do a document classification using Weka java API.
Here is my directory structure of the data files.
+- text_example
|
+- class1
| |
| 3 html files
|
+- class2
| |
| 1 html file
|
+- class3
|
3 html files
I have the 'arff' file created with 'TextDirectoryLoader'. Then I use the StringToWordVector
filter on the created arff file, with filter.setOutputWordCounts(true)
.
Below is a sample of the output once the filter is applied. I need to get few things clarified.
@attribute </form> numeric
@attribute </h1> numeric
.
.
@attribute earth numeric
@attribute easy numeric
This huge list should be the tokenization of the content of the initial html files. right?
Then I have,
@data
{1 2,3 2,4 1,11 1,12 7,..............}
{10 4,34 1,37 5,.......}
{2 1,5 6,6 16,...}
{0 class2,34 11,40 15,.....,4900 3,...
{0 class3,1 2,37 3,40 5....
{0 class3,1 2,31 20,32 17......
{0 class3,32 5,42 1,43 10.........
why there is no class attribute for the first 3 items? (it should have class1). what does the leading 0 means as in {0 class2,..}, {0 class3..}. It says, for instance, that in the 3rd html file in the class3 folder, the word identified by the integer 32 appears 5 times. Just to see how do I get the word (token) referred by 32?
How do I reduce the dimensionality of the feature vector? don't we need to make all the feature vectors the same size? (like consider only the say 100 most frequent terms from the training set and later when it comes to testing, consider the occurrence of only those 100 terms in test documents. Because, in this way what happens if we come up with a totally new word in the testing phase, will the classifier just ignore it?).
Am I missing something here? I'm new to Weka.
Also I really appreciate the help if someone can explain me how the classifier uses this vector created with StringToWordVector
filter. (like creating the vocabulary with the training data, dimensionality reduction, are those happening inside the Weka code?)
@attribute
contains all the tokens derived from your input.@data
section is in the sparse format, that is for each attribute, the value is only stated if it is different from zero. For the first three lines, the class attribute is class1, you just can't see it (if it were unknown, you would see a 0 ?
at the beginning of the first three lines). Why is that so? Weka internally represents nominal attributes (that includes classes) as doubles and starts counting at zero. So your three classes are internally: class1=0.0, class2=1.0, class3=2.0. As zero-values are not stated in the sparse format, you can't see the class in the first three lines. (Also see the section "Sparse ARFF files" on http://www.cs.waikato.ac.nz/ml/weka/arff.html)Instances
object, invoke attribute(n).name()
on it. For that, n
starts counting at 0.stringToWordVector.setWordsToKeep(100)
. Note that this will try to keep 100 words of every class. If you do not want to keep 100 words per class, stringToWordVector.setDoNotOperateOnPerClassBasis(true)
. You will get slightly above 100 if there are several words with the same frequency, so the 100 is just a kind of target value.stringToWordVector
all instances before classifying. I am not 100% sure on that one though, as I am using a two-class setup and I let StringToWordVector
transform all my instances before telling the classifier anything about it. I can generally recomment to you, to experiment with the Weka KnowledgeFlow tool to learn how to use the different classes. If you know how to do things there, you can use that knowledge for your Java code quite easily. Hope I was able to help you, although the answer is a bit late.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With