Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use the a k-fold cross validation in scikit with naive bayes classifier and NLTK

I have a small corpus and I want to calculate the accuracy of naive Bayes classifier using 10-fold cross validation, how can do it.

like image 345
user2284345 Avatar asked May 04 '13 21:05

user2284345


People also ask

How do I use naive Bayes classifier in Python?

First Approach (In case of a single feature) Step 1: Calculate the prior probability for given class labels. Step 2: Find Likelihood probability with each attribute for each class. Step 3: Put these value in Bayes Formula and calculate posterior probability.


2 Answers

Your options are to either set this up yourself or use something like NLTK-Trainer since NLTK doesn't directly support cross-validation for machine learning algorithms.

I'd recommend probably just using another module to do this for you but if you really want to write your own code you could do something like the following.

Supposing you want 10-fold, you would have to partition your training set into 10 subsets, train on 9/10, test on the remaining 1/10, and do this for each combination of subsets (10).

Assuming your training set is in a list named training, a simple way to accomplish this would be,

num_folds = 10 subset_size = len(training)/num_folds for i in range(num_folds):     testing_this_round = training[i*subset_size:][:subset_size]     training_this_round = training[:i*subset_size] + training[(i+1)*subset_size:]     # train using training_this_round     # evaluate against testing_this_round     # save accuracy  # find mean accuracy over all rounds 
like image 117
Jared Avatar answered Sep 21 '22 14:09

Jared


Actually there is no need for a long loop iterations that are provided in the most upvoted answer. Also the choice of classifier is irrelevant (it can be any classifier).

Scikit provides cross_val_score, which does all the looping under the hood.

from sklearn.cross_validation import KFold, cross_val_score k_fold = KFold(len(y), n_folds=10, shuffle=True, random_state=0) clf = <any classifier> print cross_val_score(clf, X, y, cv=k_fold, n_jobs=1) 
like image 27
Salvador Dali Avatar answered Sep 23 '22 14:09

Salvador Dali