Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - machine learning

currently I am trying to understand the way machine learning algorithms work and one thing I don't really get is the obvious difference between calculated accuracy of predicted labels and the visual confusion matrix. I will try to explain as clear as it is possible.

Here is the snippet of the dataset (here you can see 9 samples (about 4k in real dataset), 6 features and 9 labels (which stand for not numbers, but some meanings and cannot be compared like 7 > 4 > 1)):

f1      f2      f3      f4      f5    f6   label
89.18   0.412   9.1     24.17   2.4   1    1
90.1    0.519   14.3    16.555  3.2   1    2
83.42   0.537   13.3    14.93   3.4   1    3
64.82   0.68    9.1     8.97    4.5   2    4
34.53   0.703   4.9     8.22    3.5   2    5
87.19   1.045   4.7     5.32    5.4   2    6
43.23   0.699   14.9    12.375  4.0   2    7
43.29   0.702   7.3     6.705   4.0   2    8
20.498  1.505   1.321   6.4785  3.8   2    9

In favor of curiosity I tried a number of algorithms (Linear, Gaussian, SVM (SVC, SVR), Bayesian etc.). As far as I understood the manual, in my case it is better to work with classifiers (discrete), rather than regression (continuous). Using common:

model.fit(X_train, y_train) 
model.score(X_test, y_test)

I got:

Lin_Reg: 0.855793988736
Log_Reg: 0.463251670379
DTC:     0.400890868597
KNC:     0.41425389755
LDA:     0.550111358575
Gaus_NB: 0.391982182628
Bay_Rid: 0.855698151574
SVC:     0.483296213808
SVR:     0.647914795849

Continuous algorithms did better results. When I used confusion matrix for Bayesian Ridge (had to convert float to integers) to verify its result, I got the following:

Pred  l1   l2   l3   l4   l5   l6   l7   l8   l9
True
l1    23,  66,  0,   0,   0,   0,   0,   0,   0
l2    31,  57   1,   0,   0,   0,   0,   0,   0
l3    13,  85,  19   0,   0,   0,   0,   0,   0
l4    0,   0,   0,   0    1,   6,   0,   0,   0
l5    0,   0,   0,   4,   8    7,   0,   0,   0
l6    0,   0,   0,   1,   27,  36   7,   0,   0
l7    0,   0,   0,   0,   2,   15,  0    0,   0
l8    0,   0,   0,   1,   1,   30,  8,   0    0
l9    0,   0,   0,   1,   0,   9,   1,   0,   0

What gave me an understanding that 85% accuracy is wrong. How can this be explained? Is this because float/int conversion?

Would be thankful for any direct answer/link etc.

like image 928
Moveton Avatar asked Oct 21 '16 13:10

Moveton


People also ask

Is Python good for machine learning?

To reiterate, Machine Learning is simply recognizing patterns in your data to be able to make improvements and intelligent decisions on its own. Python is the most suitable programming language for this because it is easy to understand and you can read it for yourself.

Is machine learning in Python hard?

If you're going to pursue machine learning, it's a good idea to start with these key mathematical concepts and move onto the coding aspects from there. Many of the languages associated with artificial intelligence such as Python are considered relatively easy.

Is Python or C++ better for machine learning?

Python is also a leading language for data analysis and machine learning. While it is possible to use C++ for machine learning purposes as well, it is not a good option. In terms of simplicity, Python is much easier to use and has a great support system when it comes to AI and ML frameworks.

What is machine learning in Python with example?

Machine Learning (ML) is basically that field of computer science with the help of which computer systems can provide sense to data in much the same way as human beings do. In simple words, ML is a type of artificial intelligence that extract patterns out of raw data by using an algorithm or method.


2 Answers

You are mixing here two very distinct concepts of machine learning: regression and classification. Regression typically deals with continuous values, e.g. temperature or stock market value. Classification on the other hand can tell you which bird species is in the recording - that's exactly where you would use a confusion matrix. It would tell you how many times the algorithm correctly predicted the label and where it made mistakes. SciPy, which you are using, has separate sections for both.

Both for classification and regression problems you can use different metrics for scoring them, so never assume they are comparable. As @javad pointed out, the 'coefficient of determination', is very different than accuracy. I would also recommend reading on precision and recall.

In your case you clearly have a classification problem and as such it should be treated. Also, mind that f6 looks like it has a discrete set of values.

If you'd like quickly experiment with different approaches I can recommend e.g. H2O, which, next to nice API, has great user interface and allows for massive parallel processing. XGBoost is also excellent.

like image 188
Lukasz Tracewski Avatar answered Sep 25 '22 23:09

Lukasz Tracewski


Take a look at the documentation here.

If you call score() on regression methods they will return the 'coefficient of determination R^2 of the prediction' not the accuracy.

like image 41
javad Avatar answered Sep 23 '22 23:09

javad