I am working with load_iris data set from sklearn in python and R(it's just called iris in R).
I built the model in both language using "gini" index and in both languages I am able to test the model properly when the test data is taken directly from the iris data set.
However if I give a new data set as a test input, for the same python and R puts it into different categories.
I'm not sure what am I missing here or doing wrong, so any guidance will be very much appreciated.
Code mentioned below: Python 2.7:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
model = tree.DecisionTreeClassifier(criterion='gini')
model.fit(iris.data, iris.target)
model.score(iris.data, iris.target)
print iris.data[49],model.predict([iris.data[49]])
print iris.data[99],model.predict([iris.data[99]])
print iris.data[100],model.predict([iris.data[100]])
print iris.data[149],model.predict([iris.data[149]])
print [6.3,2.8,6,1.3],model.predict([[6.3,2.8,6,1.3]])
R-Rstudio running 3.3.2 32 bit:
library(rpart)
iris<- iris
x_train = iris[c('Sepal.Length','Sepal.Width','Petal.Length','Petal.Width')]
y_train = as.matrix(cbind(iris['Species']))
x <- cbind(x_train,y_train)
fit <- rpart(y_train ~ ., data = x_train,method="class",parms = list(split = "gini"))
summary(fit)
x_test = x[149,]
x_test[,1]=6.3
x_test[,2]=2.8
x_test[,3]=6
x_test[,4]=1.3
predicted1= predict(fit,x[49,]) # same as python result
predicted2= predict(fit,x[100,]) # same as python result
predicted3= predict(fit,x[101,]) # same as python result
predicted4= predict(fit,x[149,]) # same as python result
predicted5= predict(fit,x_test) ## this value does not match with pythons result
My python output is :
[ 5. 3.3 1.4 0.2] [0]
[ 5.7 2.8 4.1 1.3] [1]
[ 6.3 3.3 6. 2.5] [2]
[ 5.9 3. 5.1 1.8] [2]
[6.3, 2.8, 6, 1.3] [2] -----> this means it's putting the test data into virginica bucket
and R output is:
> predicted1
setosa versicolor virginica
49 1 0 0
> predicted2
setosa versicolor virginica
100 0 0.9074074 0.09259259
> predicted3
setosa versicolor virginica
101 0 0.02173913 0.9782609
> predicted4
setosa versicolor virginica
149 0 0.02173913 0.9782609
> predicted5
setosa versicolor virginica
149 0 0.9074074 0.09259259 --> this means it's putting the test data into versicolor bucket
Please help. Thank you.
Decision trees involve quite a few parameters (min / max leave size, depth of tree, when to split etc), and different packages may have different default settings. If you want to get the same results, you need to make sure the implicit defaults are similar. For instance, try running the following:
fit <- rpart(y_train ~ ., data = x_train,method="class",
parms = list(split = "gini"),
control = rpart.control(minsplit = 2, minbucket = 1, xval=0, maxdepth = 30))
(predicted5= predict(fit,x_test))
setosa versicolor virginica
149 0 0.3333333 0.6666667
Here, the options minsplit = 2, minbucket = 1, xval=0 and maxdepth = 30 are chosen so as to be identical to the sklearn-options, see here. maxdepth = 30 is the largest value rpart will let you have; sklearn has no bound here). If you want probabilities etc to be identical, you probably want to play around with the cp parameter as well.
Similarly, with
model = tree.DecisionTreeClassifier(criterion='gini',
min_samples_split=20,
min_samples_leaf=round(20.0/3.0), max_depth=30)
model.fit(iris.data, iris.target)
I get
print model.predict([iris.data[49]])
print model.predict([iris.data[99]])
print model.predict([iris.data[100]])
print model.predict([iris.data[149]])
print model.predict([[6.3,2.8,6,1.3]])
[0]
[1]
[2]
[2]
[1]
which looks pretty similar to your initial R output.
Needless to say, be careful when your predictions (on the training set) seem "unreasonably good", as you are likely to overfit the data. For instance, have a look at model.predict_proba(...), which gives you the probabilities in sklearn (instead of the predicted classes). You should see that with your current Python code / settings, you are almost surely overfitting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With