Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cross validation + decision trees in sklearn

Tags:

Attempting to create a decision tree with cross validation using sklearn and panads.

My question is in the code below, the cross validation splits the data, which i then use for both training and testing. I will be attempting to find the best depth of the tree by recreating it n times with different max depths set. In using cross validation should i instead be using k folds CV and if so how would I use that within the code I have?

import numpy as np import pandas as pd from sklearn import tree from sklearn import cross_validation  features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]  df = pd.read_csv('magic04.data',header=None,names=features)  df['class'] = df['class'].map({'g':0,'h':1})  x = df[features[:-1]] y = df['class']  x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)  depth = [] for i in range(3,20):     clf = tree.DecisionTreeClassifier(max_depth=i)     clf = clf.fit(x_train,y_train)     depth.append((i,clf.score(x_test,y_test))) print depth 

here is a link to the data that i am using in case that helps anyone. https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope

like image 977
razeal113 Avatar asked Jan 30 '16 01:01

razeal113


People also ask

Can you use cross-validation with decision tree?

Cross-validation is used within a wide range of machine learning approaches, such as instance based learning, artificial neural networks, or decision tree induction.

What is cross-validation decision tree?

Cross Validation of Tree To reduce error rate of a classification tree model 'pruning' can be used. It is a cross-validation technique which gives the size of tree and the corresponding deviance or error. Using the size that has the lowest deviance we can build the tree model for that size and improve accuracy.

How do I use Sklearn cross-validation?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn.model_selection import cross_val_score >>> clf = svm.SVC(kernel='linear', C=1, random_state=42) >>> scores = cross_val_score(clf, X, y, cv=5) >>> scores array([0.96..., 1.

Is GridSearchCV same as cross-validation?

Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it. So you train your models against train data set and test them on a testing data set.


1 Answers

In your code you are creating a static training-test split. If you want to select the best depth by cross-validation you can use sklearn.cross_validation.cross_val_score inside the for loop.

You can read sklearn's documentation for more information.

Here is an update of your code with CV:

import numpy as np import pandas as pd from sklearn import tree from sklearn.cross_validation import cross_val_score from pprint import pprint  features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]  df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1})  x = df[features[:-1]] y = df['class']  # x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0) depth = [] for i in range(3,20):     clf = tree.DecisionTreeClassifier(max_depth=i)     # Perform 7-fold cross validation      scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4)     depth.append((i,scores.mean())) print(depth) 

Alternatively, you can use sklearn.grid_search.GridSearchCV and not write the for loop yourself, especially if you want to optimize for more than one hyper-parameter.

import numpy as np import pandas as pd from sklearn import tree from sklearn.model_selection import GridSearchCV  features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]  df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1})  x = df[features[:-1]] y = df['class']   parameters = {'max_depth':range(3,20)} clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4) clf.fit(X=x, y=y) tree_model = clf.best_estimator_ print (clf.best_score_, clf.best_params_)  

Edit: changed how GridSearchCV is imported to accommodate learn2day's comment.

like image 117
Dimosthenis Avatar answered Oct 26 '22 01:10

Dimosthenis