Suppose my data consist of fruits, described by their color and shape and more features. I would like to return maximum of X fruits that have the features the user stated and I would like to do it in the minimum num of questions.
The first questions I always ask the user are what is the color and shape of the fruit. According to the user answer I would like to ask for K more features like texture size peel type etc.. I would like K to be the smallest num that will return the most accurate X results therefore I would like to know what is the next feature I should ask the user for. My DB consist of fruits classified to features (arbitrary values).
Is it a machine learning problem? What is the algoritm I should use and which implementation I should use. I've tried to look in scikit-learn, nltk, weka for suitable algorithm to answer this problem. Either those algorithm are not suitable for answering this problem or I need more specific guiding using them.
Thank you!
Yes it is.
A decision tree projects the points on to each feature and finds the best split. This split can be determined by different metrics, for example: gini index or entropy (information gain) Sci-kit learn has this in sklearn.tree
Suppose you have 5 data points:
color shape fruit
orange oblong orange
red round apple
orange round orange
red oblong apple
red round apple
So to train you would do something like this:
feature class | feature class
orange orange | oblong orange
red apple | round apple
orange orange | round orange
red apple | oblong apple
red apple | round apple
As you can see the best split is color because for this dataset if color=red then fruit = apple, and if color = orange then fruit = orange.
Training on these data points you would have the decision tree:
color
___________________
| |
| |
red orange
apple orange
In real life these splits would be based on numerical values i.e num > .52.
As for what algorithms to use for this, it depends. You'll have to do the research for your own data because it's more of a per dataset/preference kind of thing.
You could use sci-kit learn on the example above like this:
from sklearn.trees import DecisionTreeClassifier
#make your sample matrix
samples = [[1,1], [0,0], [1,0], [0,1], [0,0]]
#make your target vector ( in this case fruit)
fruitname = [1, 0, 1, 0, 0]
#create and fit the model
dtree = DecisionTreeClassifier()
dtree = dtree.fit(samples, fruitname)
#test an unknown red fruit that is oblong
dtree.predict([0,1])
Note that color=1 means that the fruit is orange, and shape=1 means that the fruit is oblong.
Have a look through the sci-kit user guide for a more in-depth overview.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With