I have a pandas DataFrame which looks approximately as follows:
cli_id | X1 | X2 | X3 | ... | Xn | Y |
----------------------------------------
123 | 1 | A | XX | ... | 4 | 0.1 |
456 | 2 | B | XY | ... | 5 | 0.2 |
789 | 1 | B | XY | ... | 5 | 0.3 |
101 | 2 | A | XX | ... | 4 | 0.1 |
...
I have client id, few categorical attributes and Y which is probability of an event which has values from 0 to 1 by 0.1.
I need to take a stratified sample in every group (so 10 folds) of Y of size of 200
I often use this to take a stratified sample when splitting into train/test:
def stratifiedSplit(X,y,size):
sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)
for train_index, test_index in sss:
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
return X_train, X_test, y_train, y_test
But I don't know how to modify it in this case.
For example, if the researcher wanted a sample of 50,000 graduates using age range, the proportionate stratified random sample will be obtained using this formula: (sample size/population size) x stratum size.
There are two modules provided by Scikit-learn for Stratified Splitting: StratifiedKFold : This module sets up n_folds of the dataset in a way that the samples are equally balanced in both training and test datasets. Stratification can also be achieved when splitting data by adding a relevant flag called “stratify”.
If the number of samples is the same for every group, or if the proportion is constant for every group, you could try something like
df.groupby('Y').apply(lambda x: x.sample(n=200))
or
df.groupby('Y').apply(lambda x: x.sample(frac=.1))
To perform stratified sampling with respect to more than one variable, just group with respect to more variables. It may be necessary to construct new binned variables to this end.
However, if the group size is too small w.r.t. the proportion like groupsize 1 and propotion .25, then no item will be returned. This is due to pythons rounding implementation of the int function int(0.25)=0
I'm not totally sure whether you mean this:
strats = []
for k in range(11):
y_val = k*0.1
dummy_df = your_df[your_df['Y'] == y_val]
stats.append( dummy_df.sample(200) )
That makes a dummy dataframe consisting in only the Y
values you want, and then takes a sample of 200.
OK so you need the different chunks to have the same structure. I guess that's a bit harder, here's how I would do it:
First of all, I would get a histogram of what X1
looks like:
hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))
we have now a histogram with nbins
bins.
Now the strategy is to draw a certain number of rows depending on what their value of X1
is. We will draw more from the bins with more observations and less from the bins with less, so that the structure of X
is preserved.
In particular, the relative contribution of every bin should be:
rel = [float(i) / sum(hist) for i in hist]
This will be something like [0.1, 0.2, 0.1, 0.3, 0.3]
If we want 200 samples, we need to draw:
draws_in_bin = [int(i*200) for i in rel]
Now we know how many observations to draw from every bin:
strats = []
for k in range(11):
y_val = k*0.1
#get a dataframe for every value of Y
dummy_df = your_df[your_df['Y'] == y_val]
bin_strat = []
for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin):
bin_df = dummy_df[ (dummy_df['X1']> left_edge)
& (dummy_df['X1']< right_edge) ]
bin_strat.append(bin_df.sample(n_draws))
# this takes the right number of draws out
# of the X1 bin where we currently are
# Note that every element of bin_strat is a dataframe
# with a number of entries that corresponds to the
# structure of draws_in_bin
#
#concatenate the dataframes for every bin and append to the list
strats.append( pd.concat(bin_strat) )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With