I want to divide a million-record dataset in CSV format into 80% for training and 20% for testing. How can I code for this using Java or Weka library?
You can do this in Java with the Weka library using a filter called StratifiedRemoveFolds
// Load data
DataSource source = new DataSource("/some/where/data.csv");
Instances data = source.getDataSet();
// Set class to last attribute
if (data.classIndex() == -1)
data.setClassIndex(data.numAttributes() - 1);
// use StratifiedRemoveFolds to randomly split the data
StratifiedRemoveFolds filter = new StratifiedRemoveFolds();
// set options for creating the subset of data
String[] options = new String[6];
options[0] = "-N"; // indicate we want to set the number of folds
options[1] = Integer.toString(5); // split the data into five random folds
options[2] = "-F"; // indicate we want to select a specific fold
options[3] = Integer.toString(1); // select the first fold
options[4] = "-S"; // indicate we want to set the random seed
options[5] = Integer.toString(1); // set the random seed to 1
filter.setOptions(options); // set the filter options
filter.setInputFormat(data); // prepare the filter for the data format
filter.setInvertSelection(false); // do not invert the selection
// apply filter for test data here
Instances test = Filter.useFilter(data, filter);
// prepare and apply filter for training data here
filter.setInvertSelection(true); // invert the selection to get other data
Instances train = Filter.useFilter(data, filter);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With