Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform under sampling in scikit learn?

We have a retinal dataset wherein the diseased eye information constitutes 70 percent of the information whereas the non diseased eye constitutes the remaining 30 percent.We want a dataset wherein the diseased as well as the non diseased samples should be equal in number. Is there any function available with the help of which we can do the same?

like image 531
Gaurav Patil Avatar asked Mar 23 '15 05:03

Gaurav Patil


2 Answers

I would choose to do this with Pandas DataFrame and numpy.random.choice. In that way it is easy to do random sampling to produce equally sized data-sets. An example:

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(7, 4))
data['Healthy'] = [1, 1, 0, 0, 1, 1, 1]

This data has two non-healthy and five healthy samples. To randomly pick two samples from the healthy population you do:

healthy_indices = data[data.Healthy == 1].index
random_indices = np.random.choice(healthy_indices, 2, replace=False)
healthy_sample = data.loc[random_indices]

To automatically pick a subsample of the same size as the non-healthy group you can do:

sample_size = sum(data.Healthy == 0)  # Equivalent to len(data[data.Healthy == 0])
random_indices = np.random.choice(healthy_indices, sample_size, replace=False)
like image 74
RickardSjogren Avatar answered Sep 23 '22 11:09

RickardSjogren


You can use the np.random.choice for a naive under sampling as suggested previously, but an issue can be that some of your random samples are very similar and thus misrepresents the data set.

A better option is to use the imbalanced-learn package that has multiple options for balancing a dataset. A good tutorial and description of these can be found here.

The package lists a few good options for under sampling (from their github):

  • Random majority under-sampling with replacement
  • Extraction of majority-minority Tomek links
  • Under-sampling with Cluster Centroids
  • NearMiss-(1 & 2 & 3)
  • Condensed Nearest Neighbour
  • One-Sided Selection
  • Neighboorhood Cleaning Rule
  • Edited Nearest Neighbours
  • Instance Hardness Threshold
  • Repeated Edited Nearest Neighbours
  • AllKNN
like image 22
ege Avatar answered Sep 22 '22 11:09

ege