Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Decision tree implementation for returning the next feature to split the tree

Suppose my data consist of fruits, described by their color and shape and more features. I would like to return maximum of X fruits that have the features the user stated and I would like to do it in the minimum num of questions.

The first questions I always ask the user are what is the color and shape of the fruit. According to the user answer I would like to ask for K more features like texture size peel type etc.. I would like K to be the smallest num that will return the most accurate X results therefore I would like to know what is the next feature I should ask the user for. My DB consist of fruits classified to features (arbitrary values).

Is it a machine learning problem? What is the algoritm I should use and which implementation I should use. I've tried to look in scikit-learn, nltk, weka for suitable algorithm to answer this problem. Either those algorithm are not suitable for answering this problem or I need more specific guiding using them.

Thank you!

like image 335
Liatz Avatar asked Dec 14 '25 15:12

Liatz


1 Answers

Yes it is.

A decision tree projects the points on to each feature and finds the best split. This split can be determined by different metrics, for example: gini index or entropy (information gain) Sci-kit learn has this in sklearn.tree

Suppose you have 5 data points:

 color   shape   fruit
 orange  oblong  orange
 red     round   apple
 orange  round   orange
 red     oblong  apple
 red     round   apple

So to train you would do something like this:

feature   class  |  feature  class
orange    orange |  oblong   orange
red       apple  |  round    apple
orange    orange |  round    orange
red       apple  |  oblong   apple
red       apple  |  round    apple

As you can see the best split is color because for this dataset if color=red then fruit = apple, and if color = orange then fruit = orange.

Training on these data points you would have the decision tree:

        color
___________________
|                 |
|                 |
red               orange
apple             orange

In real life these splits would be based on numerical values i.e num > .52.

As for what algorithms to use for this, it depends. You'll have to do the research for your own data because it's more of a per dataset/preference kind of thing.

You could use sci-kit learn on the example above like this:

from sklearn.trees import DecisionTreeClassifier
#make your sample matrix 
samples = [[1,1], [0,0], [1,0], [0,1], [0,0]]
#make your target vector ( in this case fruit)
fruitname = [1, 0, 1, 0, 0]
#create and fit the model
dtree =  DecisionTreeClassifier()
dtree =  dtree.fit(samples, fruitname)
#test an unknown red fruit that is oblong
dtree.predict([0,1])

Note that color=1 means that the fruit is orange, and shape=1 means that the fruit is oblong.

Have a look through the sci-kit user guide for a more in-depth overview.

like image 179
Raufio Avatar answered Dec 19 '25 02:12

Raufio



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!