cross validation + decision trees in sklearn

Tags:

Attempting to create a decision tree with cross validation using sklearn and panads.

My question is in the code below, the cross validation splits the data, which i then use for both training and testing. I will be attempting to find the best depth of the tree by recreating it n times with different max depths set. In using cross validation should i instead be using k folds CV and if so how would I use that within the code I have?

import numpy as np import pandas as pd from sklearn import tree from sklearn import cross_validation  features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]  df = pd.read_csv('magic04.data',header=None,names=features)  df['class'] = df['class'].map({'g':0,'h':1})  x = df[features[:-1]] y = df['class']  x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0)  depth = [] for i in range(3,20):     clf = tree.DecisionTreeClassifier(max_depth=i)     clf = clf.fit(x_train,y_train)     depth.append((i,clf.score(x_test,y_test))) print depth

here is a link to the data that i am using in case that helps anyone. https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope

977

asked Jan 30 '16 01:01

razeal113

1 Answers

In your code you are creating a static training-test split. If you want to select the best depth by cross-validation you can use sklearn.cross_validation.cross_val_score inside the for loop.

You can read sklearn's documentation for more information.

Here is an update of your code with CV:

import numpy as np import pandas as pd from sklearn import tree from sklearn.cross_validation import cross_val_score from pprint import pprint  features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]  df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1})  x = df[features[:-1]] y = df['class']  # x_train,x_test,y_train,y_test = cross_validation.train_test_split(x,y,test_size=0.4,random_state=0) depth = [] for i in range(3,20):     clf = tree.DecisionTreeClassifier(max_depth=i)     # Perform 7-fold cross validation      scores = cross_val_score(estimator=clf, X=x, y=y, cv=7, n_jobs=4)     depth.append((i,scores.mean())) print(depth)

Alternatively, you can use sklearn.grid_search.GridSearchCV and not write the for loop yourself, especially if you want to optimize for more than one hyper-parameter.

import numpy as np import pandas as pd from sklearn import tree from sklearn.model_selection import GridSearchCV  features = ["fLength", "fWidth", "fSize", "fConc", "fConc1", "fAsym", "fM3Long", "fM3Trans", "fAlpha", "fDist", "class"]  df = pd.read_csv('magic04.data',header=None,names=features) df['class'] = df['class'].map({'g':0,'h':1})  x = df[features[:-1]] y = df['class']   parameters = {'max_depth':range(3,20)} clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4) clf.fit(X=x, y=y) tree_model = clf.best_estimator_ print (clf.best_score_, clf.best_params_)

Edit: changed how GridSearchCV is imported to accommodate learn2day's comment.

117

answered Oct 26 '22 01:10

Dimosthenis

Related questions
                            
                                Concatenation of 2 1D `numpy` Arrays Along 2nd Axis
                            
                                ASP.NET 5 (Core): How to store objects in session-cache (ISession)?
                            
                                How to mock generators with mock.patch
                            
                                What's the most efficient way to insert an element into a sorted vector?
                            
                                Failed to load ApplicationContext (with annotation)
                            
                                Visual Studio 2015 TFS .tfignore file
                            
                                UITextView scrollEnabled = YES not working after set scrollEnabled = NO in iOS8
                            
                                NoSQL with Entity Framework Core [closed]
                            
                                How to print out http-response header in Python
                            
                                Generate bigrams with NLTK
                            
                                Using git log to display files changed during merge
                            
                                Is strncmp(NULL, "foo", 0) well defined?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With