Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stratified samples from Pandas

Tags:

python

pandas

I have a pandas DataFrame which looks approximately as follows:

cli_id | X1 | X2 | X3 | ... | Xn |  Y  |
----------------------------------------
123    | 1  | A  | XX | ... | 4  | 0.1 |
456    | 2  | B  | XY | ... | 5  | 0.2 |
789    | 1  | B  | XY | ... | 5  | 0.3 |
101    | 2  | A  | XX | ... | 4  | 0.1 |
...

I have client id, few categorical attributes and Y which is probability of an event which has values from 0 to 1 by 0.1.

I need to take a stratified sample in every group (so 10 folds) of Y of size of 200

I often use this to take a stratified sample when splitting into train/test:

def stratifiedSplit(X,y,size):
    sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)

    for train_index, test_index in sss:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    return X_train, X_test, y_train, y_test

But I don't know how to modify it in this case.

like image 559
HonzaB Avatar asked Dec 08 '16 08:12

HonzaB


People also ask

How do you get the sample from stratified?

For example, if the researcher wanted a sample of 50,000 graduates using age range, the proportionate stratified random sample will be obtained using this formula: (sample size/population size) x stratum size.

What is stratified sampling Sklearn?

There are two modules provided by Scikit-learn for Stratified Splitting: StratifiedKFold : This module sets up n_folds of the dataset in a way that the samples are equally balanced in both training and test datasets. Stratification can also be achieved when splitting data by adding a relevant flag called “stratify”.


2 Answers

If the number of samples is the same for every group, or if the proportion is constant for every group, you could try something like

df.groupby('Y').apply(lambda x: x.sample(n=200))

or

df.groupby('Y').apply(lambda x: x.sample(frac=.1))

To perform stratified sampling with respect to more than one variable, just group with respect to more variables. It may be necessary to construct new binned variables to this end.

However, if the group size is too small w.r.t. the proportion like groupsize 1 and propotion .25, then no item will be returned. This is due to pythons rounding implementation of the int function int(0.25)=0

like image 108
Quickbeam2k1 Avatar answered Oct 12 '22 12:10

Quickbeam2k1


I'm not totally sure whether you mean this:

strats = []
for k in range(11):
    y_val = k*0.1
    dummy_df = your_df[your_df['Y'] == y_val]
    stats.append( dummy_df.sample(200) )

That makes a dummy dataframe consisting in only the Y values you want, and then takes a sample of 200.

OK so you need the different chunks to have the same structure. I guess that's a bit harder, here's how I would do it:

First of all, I would get a histogram of what X1 looks like:

hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))

we have now a histogram with nbins bins.

Now the strategy is to draw a certain number of rows depending on what their value of X1 is. We will draw more from the bins with more observations and less from the bins with less, so that the structure of X is preserved.

In particular, the relative contribution of every bin should be:

rel = [float(i) / sum(hist) for i in hist]

This will be something like [0.1, 0.2, 0.1, 0.3, 0.3]

If we want 200 samples, we need to draw:

draws_in_bin = [int(i*200) for i in rel]

Now we know how many observations to draw from every bin:

strats = []
for k in range(11):
        y_val = k*0.1

        #get a dataframe for every value of Y
        dummy_df = your_df[your_df['Y'] == y_val]

        bin_strat = []
        for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin):

             bin_df = dummy_df[ (dummy_df['X1']> left_edge) 
                              & (dummy_df['X1']< right_edge) ]

             bin_strat.append(bin_df.sample(n_draws))
             # this takes the right number of draws out 
             # of the X1 bin where we currently are
             # Note that every element of bin_strat is a dataframe
             # with a number of entries that corresponds to the 
             # structure of draws_in_bin
        #
        #concatenate the dataframes for every bin and append to the list
        strats.append( pd.concat(bin_strat) )
like image 42
elelias Avatar answered Oct 12 '22 14:10

elelias