Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue understanding splitting data in Scala using "randomSplit" for Machine Learning purpose

Hi I am new in MLlib and I am reading the documents on Spark website about it. I have difficulty to understand why in the following code we need to cache "0" for training and "1" for testing:

  val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)
  val training = splits(0).cache()
  val test = splits(1)

Can anyone help me understanding the reason? As far as I know we need positive and negative samples so "1" can be positive and "0" can be negative, why it is divided like this?

Thank you!

like image 473
Rubbic Avatar asked Jul 21 '14 04:07

Rubbic


1 Answers

This has nothing to do with positive and negative examples. Those should already exist (both kinds) within the data set.

You're splitting the data randomly to generate two sets: one to use during training of the ML algorithm (training set), and the second to check whether the training is working (test set). This is widely done and a very good idea, as it catches overfitting which otherwise can make it seem like you have a great ML solution when it's actually effectively just memorized the answer for each data point and can't interpolate or generalize.

In fact, I would recommend that if you have a reasonable amount of data that you split into three data sets, "training" which you run the ML algorithms on; "test", which you use to check how your training is going; and "validation" which you never use until you think your entire ML process is optimized. (The optimization may require using the test set several times e.g. to check convergence, which renders it a somewhat fit-to data set, so it's often hard to be sure you've really avoided overfitting. Holding out the validation set to the very end is the best way to check (or, if you can gather new data, you can do that instead).)

Note that the split is random to avoid problems where the different data sets contain statistically different data; e.g. the early data might be different than the late data, so taking the first half and second half of the data set might cause problems.

like image 133
Rex Kerr Avatar answered Nov 02 '22 23:11

Rex Kerr