I have been working on something that will try to figure out the author of a column by using my own data set.
I'm planning to use mlpy python library. It has good documentation, (about 100 pages of pdf). I'm also open to other library suggestions.
The thing is, I'm lost in Data Mining and Machine Learning concepts. There is too much work on it, too many algorithms and concepts.
I'm asking for directions, what algorithms / concepts should I learn, and search for my specific problem.
So far, I've built a dataset which is something like this.
| author | feature x | feature y | feature z | some more features |
|--------+-----------+-----------+-----------+--------------------|
| A | 2 | 4 | 6 | .. |
| A | 1 | 1 | 5 | .. |
| B | 12 | 15 | 9 | .. |
| B | 13 | 13 | 13 | .. |
Now, I'll get a new column and parse it, after that I will have all the features for the column, and my aim is to figure out who the author of that column is.
As I'm not a ML guy, I can only think of getting a distance between the features on all rows and pick the closest one. But I'm pretty sure this is not the way I should go.
I'd appreciate any directions, links, readings etc.
If you have enough training data, then you can use kNN (k-Nearest Neighbor) classifier for your purpose. It is easy to understand, yet powerful.
Check scikits.ann for a possible implementation.
This tutorial here serves as a good reference for the one in scikits-learn.
Edit: In addition, here is the page for kNN of scikits-learn. You can understand it easily from the given example.
And, mlpy also seems to have kNN.
You have a wide selection of algorithms implemented on mlpy so you should be fine. I agree with Steve L when said that Support Vector Machines is great, but even when it is easier to use the inner details are not easy to grasp especially if you are new in ML.
Additionally to kNN, you could consider Classification Tree (http://en.wikipedia.org/wiki/Decision_tree_learning) and Logistic Regression (http://en.wikipedia.org/wiki/Logistic_regression).
For starters, Decision trees have the advantage that would produce an output that it is easy to understand and hence easier to debug.
Logistic Regression on the other hand, can give you good results and scale very well if you need more data.
I would say that in your case, you would be looking for the algorithm which after reading a bit you find more comfortable to work with. Most of the time, all of them are very capable to give you very decent results. Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With