Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Machine learning model suggestion for large imbalance data

I have data set for classification problem. I have in total 50 classes.

 Class1: 10,000 examples 
 Class2: 10 examples
 Class3: 5 examples 
 Class4: 35 examples
 .
 .
 . 
and so on.

I tried to train my classifier using SVM ( both linear and Gaussian kernel). My accurate is very bad on test data 65 and 72% respectively. Now I am thinking to go for a neural network. Do you have any suggestion for any machine learning model and algorithm for large imbalanced data? It would be extremely helpful to me


2 Answers

You should provide more information about the data set features and the class distribution, this would help others to advice you. In any case, I don't think a neural network fits here as this data set is too small for it.

Assuming 50% or more of the samples are of class 1 then I would first start by looking for a classifier that differentiates between class 1 and non-class 1 samples (binary classification). This classifier should outperform a naive classifier (benchmark) which randomly chooses a classification with a prior corresponding to the training set class distribution. For example, assuming there are 1,000 samples, out of which 700 are of class 1, then the benchmark classifier would classify a new sample as class 1 in a probability of 700/1,000=0.7 (like an unfair coin toss).

Once you found a classifier with good accuracy, the next phase can be classifying the non-class 1 classified samples as one of the other 49 classes, assuming these classes are more balanced then I would start with RF, NB and KNN.

like image 124
Eyal Shulman Avatar answered Sep 13 '25 09:09

Eyal Shulman


There are multiple ways to handle with imbalanced datasets, you can try

  1. Up sampling
  2. Down Sampling
  3. Class Weights

I would suggest either Up sampling or providing class weights to balance it

https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-data-in-machine-learning-80836d45d30c

You should think about your performance metric, don't use Accuracy score as your performance metric , You can use Log loss or any other suitable metric

https://machinelearningmastery.com/failure-of-accuracy-for-imbalanced-class-distributions/

like image 28
Aashutosh sinha Avatar answered Sep 13 '25 09:09

Aashutosh sinha