Split RDD for K-fold validation: pyspark

Tags:

I have a dataset and I want to apply naive bayes on that. I will be validating using K-fold technique. My data has two classes and they ordered i.e. if my data set has 100 rows, first 50 are of one class and next 50 are of second class. Hence, I first want to shuffle the data and then randomly form the K-folds. The problem is that when I am trying to randomSplit on the RDD, it is creating RDDs of different sizes. My code and an example of the dataset is as follows:

documentDF = sqlContext.createDataFrame([
    (0,"This is a cat".lower().split(" "), ),
    (0,"This is a dog".lower().split(" "), ),
    (0,"This is a pig".lower().split(" "), ),
    (0,"This is a mouse".lower().split(" "), ),
    (0,"This is a donkey".lower().split(" "), ),
    (0,"This is a monkey".lower().split(" "), ),
    (0,"This is a horse".lower().split(" "), ),
    (0,"This is a goat".lower().split(" "), ),
    (0,"This is a tiger".lower().split(" "), ),
    (0,"This is a lion".lower().split(" "), ),
    (1,"A mouse and a pig are friends".lower().split(" "), ),
    (1,"A pig and a dog are friends".lower().split(" "), ),
    (1,"A mouse and a cat are friends".lower().split(" "), ),
    (1,"A lion and a tiger are friends".lower().split(" "), ),
    (1,"A lion and a goat are friends".lower().split(" "), ),
    (1,"A monkey and a goat are friends".lower().split(" "), ),
    (1,"A monkey and a donkey are friends".lower().split(" "), ),
    (1,"A horse and a donkey are friends".lower().split(" "), ),
    (1,"A horse and a tiger are friends".lower().split(" "), ),
    (1,"A cat and a dog are friends".lower().split(" "), )
], ["label","text"])

from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.regression import LabeledPoint

def mapper_vector(x):
    row = x.text
    return LabeledPoint(x.label,row)

splitSize = [0.2]*5
print("splitSize"+str(splitSize))
print(sum(splitSize))
vect = documentDF.map(lambda x: mapper_vector(x))
splits = vect.randomSplit(splitSize, seed=0)

print("***********SPLITS**************")
for i in range(len(splits)):
    print("split"+str(i)+":"+str(len(splits[i].collect())))

This code outputs:

splitSize[0.2, 0.2, 0.2, 0.2, 0.2]
1.0
***********SPLITS**************
split0:1
split1:5
split2:3
split3:5
split4:6

The documentDF had 20 rows, I wanted 5 distinct exclusive samples from this dataset which have same size. However, it can be seen that all the splits have different sizes. What am I doing wrong?

Edit: According to zero323 I am not doing anything wrong. Then, if I want to get the final results(as described) without using the ML CrossValidator, what do I need to change? Also, why are the numbers different? If each split has an equal weightage, aren't they supposed to have equal number of rows? Also, is there any other way to randomize data?

304

asked Apr 19 '16 02:04

harshit

1 Answers

You're not doing anything wrong. randomSplit simply doesn't provide hard guarantees regarding data distribution. It is using BernoulliCellSampler (see How does Sparks RDD.randomSplit actually split the RDD) and exact fractions can differ from run to run. This is a normal behavior and should be perfectly acceptable on any real size data set where differences should be statistically insignificant.

On a side not Spark ML already provides a CrossValidator which can be used with ML Pipelines (see How to cross validate RandomForest model? for example usage).

102

answered Oct 22 '22 08:10

zero323

Related questions
                            
                                python unittest with coverage report on (sub)processes
                            
                                Adding items to a list if it's not a function
                            
                                Setup cx_Oracle Environmental Variables with Python
                            
                                does the order when defining functions in classes in python matter
                            
                                Django QuerySet contains duplicate entries
                            
                                Python decorate methods with variable number of positional args and optional arg
                            
                                Paramiko, exec_command get the output stream continuously [duplicate]
                            
                                Bug in Python? threading.Thread.start() does not always return
                            
                                BytesIO object to image
                            
                                Returning from caught "RuntimeError" always gives `None` python
                            
                                Default behavior of copy module on user-defined classes
                            
                                How can I mock/patch a decorator in python?
                            
                                Upgrading a Python 3 virtual environment [duplicate]
                            
                                Why is the __init__ method of Counter referred to as a descriptor?
                            
                                Get inserted ids after failed insert_many()
                            
                                SQLAlachmey: ORM filter to match all items in a list, not any
                            
                                How can I instal cx_Oracle package to Anaconda 3 to use with python 3.5
                            
                                Getting an empty list as attribute when parsing XML with xml.etree.ElementTree
                            
                                Why check if cls is the class in __subclasshook__?
                            
                                Why doesn't re-assignment to __builtins__.dict affect creation of new dictionary objects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split RDD for K-fold validation: pyspark

Tags:

python-3.x

apache-spark

pyspark

apache-spark-ml

apache-spark-mllib

harshit

People also ask

1 Answers

zero323

Recent Activity

Donate For Us