Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split RDD for K-fold validation: pyspark

I have a dataset and I want to apply naive bayes on that. I will be validating using K-fold technique. My data has two classes and they ordered i.e. if my data set has 100 rows, first 50 are of one class and next 50 are of second class. Hence, I first want to shuffle the data and then randomly form the K-folds. The problem is that when I am trying to randomSplit on the RDD, it is creating RDDs of different sizes. My code and an example of the dataset is as follows:

documentDF = sqlContext.createDataFrame([
    (0,"This is a cat".lower().split(" "), ),
    (0,"This is a dog".lower().split(" "), ),
    (0,"This is a pig".lower().split(" "), ),
    (0,"This is a mouse".lower().split(" "), ),
    (0,"This is a donkey".lower().split(" "), ),
    (0,"This is a monkey".lower().split(" "), ),
    (0,"This is a horse".lower().split(" "), ),
    (0,"This is a goat".lower().split(" "), ),
    (0,"This is a tiger".lower().split(" "), ),
    (0,"This is a lion".lower().split(" "), ),
    (1,"A mouse and a pig are friends".lower().split(" "), ),
    (1,"A pig and a dog are friends".lower().split(" "), ),
    (1,"A mouse and a cat are friends".lower().split(" "), ),
    (1,"A lion and a tiger are friends".lower().split(" "), ),
    (1,"A lion and a goat are friends".lower().split(" "), ),
    (1,"A monkey and a goat are friends".lower().split(" "), ),
    (1,"A monkey and a donkey are friends".lower().split(" "), ),
    (1,"A horse and a donkey are friends".lower().split(" "), ),
    (1,"A horse and a tiger are friends".lower().split(" "), ),
    (1,"A cat and a dog are friends".lower().split(" "), )
], ["label","text"])

from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.regression import LabeledPoint

def mapper_vector(x):
    row = x.text
    return LabeledPoint(x.label,row)

splitSize = [0.2]*5
print("splitSize"+str(splitSize))
print(sum(splitSize))
vect = documentDF.map(lambda x: mapper_vector(x))
splits = vect.randomSplit(splitSize, seed=0)

print("***********SPLITS**************")
for i in range(len(splits)):
    print("split"+str(i)+":"+str(len(splits[i].collect())))

This code outputs:

splitSize[0.2, 0.2, 0.2, 0.2, 0.2]
1.0
***********SPLITS**************
split0:1
split1:5
split2:3
split3:5
split4:6

The documentDF had 20 rows, I wanted 5 distinct exclusive samples from this dataset which have same size. However, it can be seen that all the splits have different sizes. What am I doing wrong?

Edit: According to zero323 I am not doing anything wrong. Then, if I want to get the final results(as described) without using the ML CrossValidator, what do I need to change? Also, why are the numbers different? If each split has an equal weightage, aren't they supposed to have equal number of rows? Also, is there any other way to randomize data?

like image 304
harshit Avatar asked Apr 19 '16 02:04

harshit


People also ask

What is the best k-fold cross-validation?

The key configuration parameter for k-fold cross-validation is k that defines the number folds in which to split a given dataset. Common values are k=3, k=5, and k=10, and by far the most popular value used in applied machine learning to evaluate models is k=10.

How many folds should I use for cross-validation?

Introduction. When performing cross-validation, it is common to use 10 folds.

What is Kfold method?

What is K-Fold? K-Fold is validation technique in which we split the data into k-subsets and the holdout method is repeated k-times where each of the k subsets are used as test set and other k-1 subsets are used for the training purpose.


1 Answers

You're not doing anything wrong. randomSplit simply doesn't provide hard guarantees regarding data distribution. It is using BernoulliCellSampler (see How does Sparks RDD.randomSplit actually split the RDD) and exact fractions can differ from run to run. This is a normal behavior and should be perfectly acceptable on any real size data set where differences should be statistically insignificant.

On a side not Spark ML already provides a CrossValidator which can be used with ML Pipelines (see How to cross validate RandomForest model? for example usage).

like image 102
zero323 Avatar answered Oct 22 '22 08:10

zero323