Attempting to create a decision tree with cross validation using sklearn and panads.
My question is in the code below, the cross validation splits the data, which i then use for both training and testing. I will be attempting to find the best depth of the tree by recreating it n times with different max depths set. In using cross validation should i instead be using k folds CV and if so how would I use that within the code I have?
import numpy as np import pandas as pd from sklearn import tree from sklearn import cross_validation features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"] df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1}) x = df[features[:-1]] y = df['class'] x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0) depth = [] for i in range(3,20): clf = tree.DecisionTreeClassifier(max_depth=i) clf = clf.fit(x_train,y_train) depth.append((i,clf.score(x_test,y_test))) print depth
here is a link to the data that i am using in case that helps anyone. https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope
Cross-validation is used within a wide range of machine learning approaches, such as instance based learning, artificial neural networks, or decision tree induction.
Cross Validation of Tree To reduce error rate of a classification tree model 'pruning' can be used. It is a cross-validation technique which gives the size of tree and the corresponding deviance or error. Using the size that has the lowest deviance we can build the tree model for that size and improve accuracy.
The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn.model_selection import cross_val_score >>> clf = svm.SVC(kernel='linear', C=1, random_state=42) >>> scores = cross_val_score(clf, X, y, cv=5) >>> scores array([0.96..., 1.
Yes, GridSearchCV performs cross-validation. If I understand the concept correctly - you want to keep part of your data set unseen for the model in order to test it. So you train your models against train data set and test them on a testing data set.
In your code you are creating a static training-test split. If you want to select the best depth by cross-validation you can use sklearn.cross_validation.cross_val_score
inside the for loop.
You can read sklearn's documentation for more information.
Here is an update of your code with CV:
import numpy as np import pandas as pd from sklearn import tree from sklearn.cross_validation import cross_val_score from pprint import pprint features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"] df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1}) x = df[features[:-1]] y = df['class'] # x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0) depth = [] for i in range(3,20): clf = tree.DecisionTreeClassifier(max_depth=i) # Perform 7-fold cross validation scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4) depth.append((i,scores.mean())) print(depth)
Alternatively, you can use sklearn.grid_search.GridSearchCV
and not write the for loop yourself, especially if you want to optimize for more than one hyper-parameter.
import numpy as np import pandas as pd from sklearn import tree from sklearn.model_selection import GridSearchCV features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"] df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1}) x = df[features[:-1]] y = df['class'] parameters = {'max_depth':range(3,20)} clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4) clf.fit(X=x, y=y) tree_model = clf.best_estimator_ print (clf.best_score_, clf.best_params_)
Edit: changed how GridSearchCV is imported to accommodate learn2day's comment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With