Is there a built in function in either Pandas or Scikit-learn for resampling according to a specified strategy? I want to resample my data based on a categorical variable.
For example, if my data has 75% men and 25% women, but I'd like to train my model on 50% men and 50% women. (I'd also like to be able to generalize to cases that aren't 50/50)
What I need is something that resamples my data according to specified proportions.
My stab at a function to do what I want is below. Hope this is helpful to someone else.
X
and y
are assumed to be a Pandas DataFrame and Series respectively.
def resample(X, y, sample_type=None, sample_size=None, class_weights=None, seed=None):
# Nothing to do if sample_type is 'abs' or not set. sample_size should then be int
# If sample type is 'min' or 'max' then sample_size should be float
if sample_type == 'min':
sample_size_ = np.round(sample_size * y.value_counts().min()).astype(int)
elif sample_type == 'max':
sample_size_ = np.round(sample_size * y.value_counts().max()).astype(int)
else:
sample_size_ = max(int(sample_size), 1)
if seed is not None:
np.random.seed(seed)
if class_weights is None:
class_weights = dict()
X_resampled = pd.DataFrame()
for yi in y.unique():
size = np.round(sample_size_ * class_weights.get(yi, 1.)).astype(int)
X_yi = X[y == yi]
sample_index = np.random.choice(X_yi.index, size=size)
X_resampled = X_resampled.append(X_yi.reindex(sample_index))
return X_resampled
If you are open to importing a library, I find the imbalanced-learn library useful when addressing resampling. Here the categorical variable is the target 'y' and the data to re-sample on is 'X'. In the example below fish are resampled to equal the number of dogs, 3:3.
The code is slightly modified from the docs on imbalance-learn: 2.1.1. Naive random over-sampling. You can use this method with numeric data and strings.
import numpy as np
from collections import Counter
from imblearn.over_sampling import RandomOverSampler
y = np.array([1,1,0,0,0]); # Fish / Dog
print('target:\n', y)
X = np.array([['red fish'],['blue fish'],['dog'],['dog'],['dog']]);
print('data:\n',X);
print('Original dataset shape {}'.format(Counter(y))) # Original dataset shape Counter({1: 900, 0: 100})
print(type(X)); print(X);
print(y);
ros = RandomOverSampler(ratio='auto', random_state=42);
X_res, y_res = ros.fit_sample(X, y);
print('Resampled dataset shape {}'.format(Counter(y_res))) # Resampled dataset shape Counter({0: 900, 1: 900});
print(type(X_res)); print(X_res); print(y_res);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With