I'm trying to take a set of reviews, and convert them into the ARFF format for use with WEKA. Unfortunately either I completely misunderstand how the format works, or I'll have to have an attribute for ALL possible words, then a presence indicator. Does anyone know a better way, or ideally have a sample ARFF file?
ARFF stands for Attribute-Relation File Format. It is an ASCII text file that describes a list of instances sharing a set of attributes.
ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information. Lines that begin with a % are comments.
On the right-hand side of the Preprocesses tabs is a "Save..." button. You can click on that and save your data as a . arff file. This is a bit long-winded to explain, but takes only a few moments to perform and is very intuitive.
The ARFF extension is for the Attribute-Relation File Format and it stores data that describes a list of instances that have a shared set of attributes. This file format is classified as Developer.
If you store the reviews in plain text files and different folders (positive and negative in your case) you can use TextDirectoryLoader.
You find this in the KnowledgeFlow application in Weka or from the command line. More info here: http://weka.wikispaces.com/ARFF+files+from+Text+Collections
Took a while to work out, but with this input.arff:
@relation text_files
@attribute review string
@attribute sentiment {0, 1}
@data
"this is some text", 1
"this is some more text", 1
"different stuff", 0
And this command:
java -classpath "C:\\Program Files\\Weka-3-6\\weka.jar" weka.filters.unsupervised.attribute.StringToWordVector -i input.arff -o output.arff
The following is produced:
@relation 'text_files-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'
@attribute sentiment {0,1}
@attribute different numeric
@attribute is numeric
@attribute more numeric
@attribute some numeric
@attribute stuff numeric
@attribute text numeric
@attribute this numeric
@data
{0 1,2 1,4 1,6 1,7 1}
{0 1,2 1,3 1,4 1,6 1,7 1}
{1 1,5 1}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With