Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use SMOTE to oversample image data

I'm doing a binary classification with CNNs and the data is imbalanced where the positive medical image : negative medical image = 0.4 : 0.6. So I want to use SMOTE to oversample the positive medical image data before training. However, the dimension of the data is 4D (761,64,64,3) which cause the error

Found array with dim 4. Estimator expected <= 2

So, I reshape my train_data:

X_res, y_res = smote.fit_sample(X_train.reshape(X_train.shape[0], -1), y_train.ravel())

And it works fine. Before feed it to CNNs, I reshape it back by:

X_res = X_res.reshape(X_res.shape[0], 64, 64, 3)

Now, I'm not sure is it a correct way to oversample and will the reshape operator change the images' structer?

like image 421
Salmon Avatar asked Dec 07 '18 09:12

Salmon


People also ask

Can smote be used for image data?

It is used to obtain a synthetically class-balanced or nearly class-balanced training set, which is then used to train the classifier. SMOTE actually performs better than simple oversampling, but although it is not quite popular with images as much as its popularity when dealing with structured data.

Is smote good for Imbalanced data?

SMOTE: a powerful solution for imbalanced data SMOTE stands for Synthetic Minority Oversampling Technique. The method was proposed in a 2002 paper in the Journal of Artificial Intelligence Research. SMOTE is an improved method of dealing with imbalanced data in classification problems.

How do you solve an imbalanced image dataset?

One of the basic approaches to deal with the imbalanced datasets is to do data augmentation and re-sampling. There are two types of re-sampling such as under-sampling when we removing the data from the majority class and over-sampling when we adding repetitive data to the minority class.

Does smote work for imbalanced image datasets too?

What happens under the hood is a 5-fold CV meaning the X_train is again split in 80:20 for five times where 20% of the data set is where SMOTE isn’t applied. This is my understanding. Hi ! SMOTE works for imbalanced image datasets too ? No, it is designed for tabular data. You might be able to use image augmentation in the same manner.

What is SMOTE and how does it work?

Choose a Career Track Мaster the skills for the specific job role you want - Data Scientist, Data Analyst, or Business… So, what is SMOTE? SMOTE or Synthetic Minority Oversampling Technique is an oversampling technique but SMOTE working differently than your typical oversampling.

What is the use of pipeline in smote sampling?

The pipeline is fit and then the pipeline can be used to make predictions on new data. Yes, call pipeline.predict () to ensure the data is prepared correctly prior to being passed to the model. Hi Jason, SMOTE sampling is done before / after data cleaning or pre-processing or feature engineering???

How to use smote with multi-class data?

You can apply SMOTE directly fir multi-class, or you can specify the preferred balance of the classes to SMOTE. Thanks for sharing Jason. In imblearn.pipeline the predict method says tahar it applies transforms AND sampling and then the final predict of the estimator.


Video Answer


3 Answers

I had a similar issue. I had used the reshape function to reshape the image (basically flattened the image)

X_train.shape
(8000, 250, 250, 3)

ReX_train = X_train.reshape(8000, 250 * 250 * 3)
ReX_train.shape
(8000, 187500)

smt = SMOTE()
Xs_train, ys_train = smt.fit_sample(ReX_train, y_train)

Although, this approach is pathetically slow, but helped to improve the performance.

like image 187
Aditya Bhattacharya Avatar answered Oct 18 '22 20:10

Aditya Bhattacharya


  1. As soon as you flatten an image you are loosing localized information, this is one of the reasons why convolutions are used in image-based machine learning.
  2. 8000x250x250x3 has an inherent meaning - 8000 samples of images, each image of width 250, height 250 and all of them have 3 channels when you do 8000x250*250*3 reshape is just a bunch of numbers unless you use some kind of sequence network to teach its bad.
  3. oversampling is bad for image data, you can do image augmentations (20crop, introducing noise like a gaussian blur, rotations, translations, etc..)
like image 21
cerofrais Avatar answered Oct 18 '22 22:10

cerofrais


  • First Flatten the image
  • Apply SMOTE on this flattened image data and its labels
  • Reshape the flattened image to RGB image
from imblearn.over_sampling import SMOTE
    
sm = SMOTE(random_state=42)
    
train_rows=len(X_train)
X_train = X_train.reshape(train_rows,-1)
(80,30000)

X_train, y_train = sm.fit_resample(X_train, y_train)
X_train = X_train.reshape(-1,100,100,3)
(>80,100,100,3)

like image 45
Hemanth Kollipara Avatar answered Oct 18 '22 20:10

Hemanth Kollipara