Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Major assumptions of machine learning classifiers (LG, SVM, and decision trees)

In classical statistics, people usually state what assumptions are assumed (i.e. normality and linearity of data, independence of data). But when I am reading machine learning textbooks and tutorials, the underlying assumptions are not always explicitly or completely stated. What are the major assumptions of the following ML classifiers for binary classification, and which ones are not so important to uphold and which one must be uphold strictly?

  • Logistic regression
  • Support vector machine (linear and non-linear kernel)
  • Decision trees
like image 247
KubiK888 Avatar asked Feb 16 '16 01:02

KubiK888


People also ask

What are the assumptions of SVM?

Thus, SVMs can be defined as linear classifiers under the following two assumptions: The margin should be as large as possible. The support vectors are the most useful data points because they are the ones most likely to be incorrectly classified.

What are the assumptions of machine learning?

It assumes that there is minimal or no multicollinearity among the independent variables. It usually requires a large sample size to predict properly. It assumes the observations to be independent of each other.

What is the difference between SVM and decision tree?

SVM uses kernel trick to solve non-linear problems whereas decision trees derive hyper-rectangles in input space to solve the problem. Decision trees are better for categorical data and it deals colinearity better than SVM.

Does SVM use decision tree?

In Machine Learning, tree-based techniques and Support Vector Machines (SVM) are popular tools to build prediction models. Decision trees and SVM can be intuitively understood as classifying different groups (labels), given their theories.


2 Answers

IID is the fundamental assumption of almost all statistical learning methods.

Logistic Regression is a special case of GLM(generalized linear model). So despite some technique requirements, the most strict restriction lies in the specific distribution of data distribution. Data MUST has a distribution in exponential family. You can dig deeper in https://en.wikipedia.org/wiki/Generalized_linear_model, and Stanford CS229 lecture note1 also has a excellent coverage of this topic.

SVM is quite tolerant of input data, especially the soft-margin version. I can not remember any specific assumption of data is taken(please correct).

Decision tree tells the same story as SVM.

like image 59
RogerTR Avatar answered Oct 24 '22 07:10

RogerTR


Great question.

Logistic Regression also assumes the following:

  1. That there isn't (or there is little) multicollinearity (high correlation) among the independent variables.

  2. Even though LR doesn't require the dependent and independent variables to be linearly related, it does however require that the independent variables to be linearly related to the log odds. The log odds function is simply log(p/1-p).

like image 34
msarafzadeh Avatar answered Oct 24 '22 06:10

msarafzadeh