Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ARFF for natural language processing

I'm trying to take a set of reviews, and convert them into the ARFF format for use with WEKA. Unfortunately either I completely misunderstand how the format works, or I'll have to have an attribute for ALL possible words, then a presence indicator. Does anyone know a better way, or ideally have a sample ARFF file?

like image 688
Dean Barnes Avatar asked May 28 '11 14:05

Dean Barnes


People also ask

What are ARFF files used for?

ARFF stands for Attribute-Relation File Format. It is an ASCII text file that describes a list of instances sharing a set of attributes.

How many sections are in a .arff file?

ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information. Lines that begin with a % are comments.

How do I change a text file to ARFF?

On the right-hand side of the Preprocesses tabs is a "Save..." button. You can click on that and save your data as a . arff file. This is a bit long-winded to explain, but takes only a few moments to perform and is very intuitive.

What is ARFF extension?

The ARFF extension is for the Attribute-Relation File Format and it stores data that describes a list of instances that have a shared set of attributes. This file format is classified as Developer.


2 Answers

If you store the reviews in plain text files and different folders (positive and negative in your case) you can use TextDirectoryLoader.

You find this in the KnowledgeFlow application in Weka or from the command line. More info here: http://weka.wikispaces.com/ARFF+files+from+Text+Collections

like image 148
zdepablo Avatar answered Nov 16 '22 20:11

zdepablo


Took a while to work out, but with this input.arff:

@relation text_files

@attribute review string
@attribute sentiment {0, 1}

@data
"this is some text", 1
"this is some more text", 1
"different stuff", 0

And this command:

java -classpath "C:\\Program Files\\Weka-3-6\\weka.jar" weka.filters.unsupervised.attribute.StringToWordVector -i input.arff -o output.arff

The following is produced:

@relation 'text_files-weka.filters.unsupervised.attribute.StringToWordVector-R1-W1000-prune-rate-1.0-N0-stemmerweka.core.stemmers.NullStemmer-M1-tokenizerweka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?!\"'

@attribute sentiment {0,1}
@attribute different numeric
@attribute is numeric
@attribute more numeric
@attribute some numeric
@attribute stuff numeric
@attribute text numeric
@attribute this numeric

@data

{0 1,2 1,4 1,6 1,7 1}
{0 1,2 1,3 1,4 1,6 1,7 1}
{1 1,5 1}
like image 3
Dean Barnes Avatar answered Nov 16 '22 20:11

Dean Barnes