I am writing a very simple script. All I have to do is read data using panda and then train a decision tree on data. Data that I am using is:
https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
And following is my script
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import preprocessing
import pandas as pd
balance_data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep= ',', header= None)
#print "Dataset:: "
#df1.head()
X = balance_data.values[:, 0:5]
Y = balance_data.values[:,6]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 100)
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
From the error I am guessing that it couldn't convert "med" attribute value to float. And by looking at the data my random guess is that low has a space before it and med doesn't. That is why it is getting confused. But I am not sure of it. Please tell what could be wrong with it. PS: error is occurring at the last line and here is the traceback
ValueError Traceback (most recent call last)
<ipython-input-26-b495e5f26174> in <module>()
18 max_depth=3, min_samples_leaf=5)
19 X_train[X_train != '']
---> 20 clf_gini.fit(X_train, y_train)
/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
788 sample_weight=sample_weight,
789 check_input=check_input,
--> 790 X_idx_sorted=X_idx_sorted)
791 return self
792
/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
114 random_state = check_random_state(self.random_state)
115 if check_input:
--> 116 X = check_array(X, dtype=DTYPE, accept_sparse="csc")
117 y = check_array(y, ensure_2d=False, dtype=None)
118 if issparse(X):
/home/fatima/anaconda2/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
400 force_all_finite)
401 else:
--> 402 array = np.array(array, dtype=dtype, order=order, copy=copy)
403
404 if ensure_2d:
ValueError: could not convert string to float: med
The dataset looks like this:
0 1 2 3 4 5 6
0 vhigh vhigh 2 2 small low unacc
1 vhigh vhigh 2 2 small med unacc
2 vhigh vhigh 2 2 small high unacc
3 vhigh vhigh 2 2 med low unacc
4 vhigh vhigh 2 2 med med unacc
Where the data types (dtypes) are all objects. However, machine learning algorithms can only learn from numbers (int, float, doubles .. ) thus, you need to encode your data before you use it for training.
There are several ways to encode your data, one way is to use label encoding
, to do that, add the following lines to your code just after loading the dataset:
le = preprocessing.LabelEncoder()
balance_data = balance_data.apply(le.fit_transform)
Now the data in balance_data
looks like this:
0 1 2 3 4 5 6
0 3 3 0 0 2 1 2
1 3 3 0 0 2 2 2
2 3 3 0 0 2 0 2
3 3 3 0 0 1 1 2
4 3 3 0 0 1 2 2
where all data types are int.
In general, you need to perform some data preprocessing before training/fitting your model. For that, I recommend that you go through some tutorial to understand the process. For instance, check this:
Here's the overall code with the fix:
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn import preprocessing
import pandas as pd
balance_data=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
sep= ',', header= None)
#print "Dataset:: "
#df1.head()
le = preprocessing.LabelEncoder()
balance_data = balance_data.apply(le.fit_transform)
X = balance_data.values[:, 0:5]
Y = balance_data.values[:,6]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 100)
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With