Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I divide a dataset into training and test sets using Weka?

Tags:

java

csv

weka

I want to divide a million-record dataset in CSV format into 80% for training and 20% for testing. How can I code for this using Java or Weka library?

like image 314
Jeet Avatar asked Feb 11 '23 02:02

Jeet


1 Answers

You can do this in Java with the Weka library using a filter called StratifiedRemoveFolds

// Load data  
DataSource source = new DataSource("/some/where/data.csv");
Instances data = source.getDataSet();

// Set class to last attribute
if (data.classIndex() == -1)
    data.setClassIndex(data.numAttributes() - 1);

// use StratifiedRemoveFolds to randomly split the data  
StratifiedRemoveFolds filter = new StratifiedRemoveFolds();

// set options for creating the subset of data
String[] options = new String[6];

options[0] = "-N";                 // indicate we want to set the number of folds                        
options[1] = Integer.toString(5);  // split the data into five random folds
options[2] = "-F";                 // indicate we want to select a specific fold
options[3] = Integer.toString(1);  // select the first fold
options[4] = "-S";                 // indicate we want to set the random seed
options[5] = Integer.toString(1);  // set the random seed to 1

filter.setOptions(options);        // set the filter options
filter.setInputFormat(data);       // prepare the filter for the data format    
filter.setInvertSelection(false);  // do not invert the selection

// apply filter for test data here
Instances test = Filter.useFilter(data, filter);

//  prepare and apply filter for training data here
filter.setInvertSelection(true);     // invert the selection to get other data 
Instances train = Filter.useFilter(data, filter);
like image 169
Walter Avatar answered Feb 13 '23 19:02

Walter