I'm now trying to build a J48 (C4.5) classifier model on my training data using Weka.
First I do this, which seems to go OK:
java -Xmx10G -cp /weka/weka.jar weka.core.converters.TextDirectoryLoader -dir /home/test/cats > /home/test/cats.arff
This seems to go OK too:
java -Xmx10G -cp /weka/weka.jar weka.filters.unsupervised.attribute.StringToWordVector -i /home/test/cats.arff -o /home/test/cats-vector.arff
This does not go OK:
java -Xmx10G -cp /weka/weka.jar weka.classifiers.trees.J48 -t /home/test/cats-vector.arff -d /home/test/cats.model
It gives the following error:
weka.core.UnsupportedAttributeTypeException: weka.classifiers.trees.j48.C45Prune ableClassifierTree: Cannot handle numeric class!
at weka.core.Capabilities.test(Capabilities.java:954)
at weka.core.Capabilities.test(Capabilities.java:1110)
at weka.core.Capabilities.test(Capabilities.java:1023)
at weka.core.Capabilities.testWithFail(Capabilities.java:1302)
at weka.classifiers.trees.j48.C45PruneableClassifierTree.buildClassifier (C45PruneableClassifierTree.java:116)
at weka.classifiers.trees.J48.buildClassifier(J48.java:236)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1076)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
So I then tried this:
java -Xmx10G -cp /weka/weka.jar weka.classifiers.trees.J48 -t /home/test/cats.arff -d /home/test/cats.model
Which also gives the error:
weka.core.UnsupportedAttributeTypeException: weka.classifiers.trees.j48.C45PruneableClassifierTree: Cannot handle string attributes!
at weka.core.Capabilities.test(Capabilities.java:980)
at weka.core.Capabilities.test(Capabilities.java:869)
at weka.core.Capabilities.test(Capabilities.java:1085)
at weka.core.Capabilities.test(Capabilities.java:1023)
at weka.core.Capabilities.testWithFail(Capabilities.java:1302)
at weka.classifiers.trees.j48.C45PruneableClassifierTree.buildClassifier(C45PruneableClassifierTree.java:116)
at weka.classifiers.trees.J48.buildClassifier(J48.java:236)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1076)
at weka.classifiers.Classifier.runClassifier(Classifier.java:312)
at weka.classifiers.trees.J48.main(J48.java:948)
Obviously I've prepared the data wrong somehow (BTW the input is text files in subdirectories which are named by the categories that I want). But I thought I was following the instructions on the Weka Wiki: Weka Wiki Categorizing Text Files Weka Wiki Primer
So what am I doing wrong? I would like to use J48 because it's given high accuracy on my data in tests. So what do I do to my data to get the J48 classifier to accept it? Or do I need to use a different classifier?
Please help!
The word vectors could be converted to binary like this:
java -Xmx4G -cp /weka/weka.jar weka.filters.unsupervised.attribute.NumericToBinary -i /home/test/cats-vector.arff -o /home/test/cats-binary.arff
Although this adds bias to the kind of data you are training against. This implies that binary strings very close to one-another are treated as more similar to strings far away. If you want to erase this bias and regard each string as a totally unique entity then use @attribute class {ABC, DEF, GHI, etc} Then it works!
If you really want to communicate that these features are important and not-at-all related, make a whole column for each string, where it has the value '1' for when a row has that category, and 0 when it does not. This creates very sparse data, but then the learning algorithm has a bias to scan that data for information gain.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With