Which machine learning classifier to choose, in general? [closed]

Tags:

machine-learning

Cross validation

What you do is simply to split your dataset into k non-overlapping subsets (folds), train a model using k-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, ... , then kth, and train with the remaining folds). After finishing, you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).

How to choose the parameter k depends on the time you have. Usual values for k are 3, 5, 10 or even N, where N is the size of your data (that's the same as leave-one-out cross validation). I prefer 5 or 10.

Model selection

Let's say you have 5 methods (ANN, SVM, KNN, etc) and 10 parameter combinations for each method (depending on the method). You simply have to run cross validation for each method and parameter combination (5 * 10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model.

There are some more things to say. If, for example, you use a lot of methods and parameter combinations for each, it's very likely you will overfit. In cases like these, you have to use nested cross validation.

Nested cross validation

In nested cross validation, you perform cross validation on the model selection algorithm.

Again, you first split your data into k folds. After each step, you choose k-1 as your training data and the remaining one as your test data. Then you run model selection (the procedure I explained above) for each possible combination of those k folds. After finishing this, you will have k models, one for each combination of folds. After that, you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. That's your final model.

Of course, there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.

The book "OpenCV" has a great two pages on this on pages 462-463. Searching the Amazon preview for the word "discriminative" (probably google books also) will let you see the pages in question. These two pages are the greatest gem I have found in this book.

In short:

Boosting - often effective when a large amount of training data is available.
Random trees - often very effective and can also perform regression.
K-nearest neighbors - simplest thing you can do, often effective but slow and requires lots of memory.
Neural networks - Slow to train but very fast to run, still optimal performer for letter recognition.
SVM - Among the best with limited data, but losing against boosting or random trees only when large data sets are available.

Things you might consider in choosing which algorithm to use would include:

Do you need to train incrementally (as opposed to batched)?

If you need to update your classifier with new data frequently (or you have tons of data), you'll probably want to use Bayesian. Neural nets and SVM need to work on the training data in one go.
Is your data composed of categorical only, or numeric only, or both?

I think Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values.
Does you or your audience need to understand how the classifier works?

Use Bayesian or decision trees, since these can be easily explained to most people. Neural networks and SVM are "black boxes" in the sense that you can't really see how they are classifying data.
How much classification speed do you need?

SVM's are fast when it comes to classifying since they only need to determine which side of the "line" your data is on. Decision trees can be slow especially when they're complex (e.g. lots of branches).
Complexity.

Neural nets and SVMs can handle complex non-linear classification.

As Prof Andrew Ng often states: always begin by implementing a rough, dirty algorithm, and then iteratively refine it.

For classification, Naive Bayes is a good starter, as it has good performances, is highly scalable and can adapt to almost any kind of classification task. Also 1NN (K-Nearest Neighbours with only 1 neighbour) is a no-hassle best fit algorithm (because the data will be the model, and thus you don't have to care about the dimensionality fit of your decision boundary), the only issue is the computation cost (quadratic because you need to compute the distance matrix, so it may not be a good fit for high dimensional data).

Another good starter algorithm is the Random Forests (composed of decision trees), this is highly scalable to any number of dimensions and has generally quite acceptable performances. Then finally, there are genetic algorithms, which scale admirably well to any dimension and any data with minimal knowledge of the data itself, with the most minimal and simplest implementation being the microbial genetic algorithm (only one line of C code! by Inman Harvey in 1996), and one of the most complex being CMA-ES and MOGA/e-MOEA.

And remember that, often, you can't really know what will work best on your data before you try the algorithms for real.

As a side-note, if you want a theoretical framework to test your hypothesis and algorithms theoretical performances for a given problem, you can use the PAC (Probably approximately correct) learning framework (beware: it's very abstract and complex!), but to summary, the gist of PAC learning says that you should use the less complex, but complex enough (complexity being the maximum dimensionality that the algo can fit) algorithm that can fit your data. In other words, use the Occam's razor.

Sam Roweis used to say that you should try naive Bayes, logistic regression, k-nearest neighbour and Fisher's linear discriminant before anything else.

My take on it is that you always run the basic classifiers first to get some sense of your data. More often than not (in my experience at least) they've been good enough.

So, if you have supervised data, train a Naive Bayes classifier. If you have unsupervised data, you can try k-means clustering.

Another resource is one of the lecture videos of the series of videos Stanford Machine Learning, which I watched a while back. In video 4 or 5, I think, the lecturer discusses some generally accepted conventions when training classifiers, advantages/tradeoffs, etc.

Related questions
                            
                                Can anyone explain me StandardScaler?
                            
                                How to train an artificial neural network to play Diablo 2 using visual input?
                            
                                What is exactly sklearn.pipeline.Pipeline?
                            
                                Many to one and many to many LSTM examples in Keras
                            
                                What is the difference between steps and epochs in TensorFlow?
                            
                                What is the role of "Flatten" in Keras?
                            
                                How to understand Locality Sensitive Hashing? [closed]
                            
                                TensorFlow, why was python the chosen language?
                            
                                Why must a nonlinear activation function be used in a backpropagation neural network? [closed]
                            
                                Intuitive understanding of 1D, 2D, and 3D convolutions in convolutional neural networks [closed]
                            
                                Why do we have to normalize the input for an artificial neural network? [closed]
                            
                                How can I run Tensorboard on a remote server?
                            
                                Nearest neighbors in high-dimensional data? [closed]
                            
                                How to extract the decision rules from scikit-learn decision-tree?
                            
                                How to initialize weights in PyTorch?
                            
                                How can I one hot encode in Python?
                            
                                Why binary_crossentropy and categorical_crossentropy give different performances for the same problem?
                            
                                Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
                            
                                How to split data into 3 sets (train, validation and test)?
                            
                                Difference between classification and clustering in data mining? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With