I have a pandas DataFrame which looks approximately as follows: <pre class="prettyprint"><code>cli_id | X1 | X2 | X3 | ... | Xn | Y | ---------------------------------------- 123 | 1 | A | XX | ... | 4 | 0.1 | 456 | 2 | B | XY | ... | 5 | 0.2 | 789 | 1 | B | XY | ... | 5 | 0.3 | 101 | 2 | A | XX | ... | 4 | 0.1 | ... </code></pre> I have client id, few categorical attributes and Y which is probability of an event which has values from 0 to 1 by 0.1. I need to take a stratified sample in every group (so 10 folds) of Y of size of 200 I often use this to take a stratified sample when splitting into train/test: <pre class="prettyprint"><code>def stratifiedSplit(X,y,size): sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0) for train_index, test_index in sss: X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] return X_train, X_test, y_train, y_test </code></pre> But I don't know how to modify it in this case.

I'm not totally sure whether you mean this: <pre class="prettyprint"><code>strats = [] for k in range(11): y_val = k*0.1 dummy_df = your_df[your_df['Y'] == y_val] stats.append( dummy_df.sample(200) ) </code></pre> That makes a dummy dataframe consisting in only the <code>Y</code> values you want, and then takes a sample of 200. OK so you need the different chunks to have the same structure. I guess that's a bit harder, here's how I would do it: First of all, I would get a histogram of what <code>X1</code> looks like: <pre class="prettyprint"><code>hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins)) </code></pre> we have now a histogram with <code>nbins</code> bins. Now the strategy is to draw a certain number of rows depending on what their value of <code>X1</code> is. We will draw more from the bins with more observations and less from the bins with less, so that the structure of <code>X</code> is preserved. In particular, the relative contribution of every bin should be: <pre class="prettyprint"><code>rel = [float(i) / sum(hist) for i in hist] </code></pre> This will be something like <code>[0.1, 0.2, 0.1, 0.3, 0.3]</code> If we want 200 samples, we need to draw: <pre class="prettyprint"><code>draws_in_bin = [int(i*200) for i in rel] </code></pre> Now we know how many observations to draw from every bin: <pre class="prettyprint"><code>strats = [] for k in range(11): y_val = k*0.1 #get a dataframe for every value of Y dummy_df = your_df[your_df['Y'] == y_val] bin_strat = [] for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin): bin_df = dummy_df[ (dummy_df['X1']> left_edge) & (dummy_df['X1']< right_edge) ] bin_strat.append(bin_df.sample(n_draws)) # this takes the right number of draws out # of the X1 bin where we currently are # Note that every element of bin_strat is a dataframe # with a number of entries that corresponds to the # structure of draws_in_bin # #concatenate the dataframes for every bin and append to the list strats.append( pd.concat(bin_strat) ) </code></pre>

Stratified samples from Pandas

Tags:

python

pandas

I have a pandas DataFrame which looks approximately as follows:

cli_id | X1 | X2 | X3 | ... | Xn |  Y  |
----------------------------------------
123    | 1  | A  | XX | ... | 4  | 0.1 |
456    | 2  | B  | XY | ... | 5  | 0.2 |
789    | 1  | B  | XY | ... | 5  | 0.3 |
101    | 2  | A  | XX | ... | 4  | 0.1 |
...

I have client id, few categorical attributes and Y which is probability of an event which has values from 0 to 1 by 0.1.

I need to take a stratified sample in every group (so 10 folds) of Y of size of 200

I often use this to take a stratified sample when splitting into train/test:

def stratifiedSplit(X,y,size):
    sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)

    for train_index, test_index in sss:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    return X_train, X_test, y_train, y_test

But I don't know how to modify it in this case.

559

asked Dec 08 '16 08:12

HonzaB

2 Answers

If the number of samples is the same for every group, or if the proportion is constant for every group, you could try something like

df.groupby('Y').apply(lambda x: x.sample(n=200))

df.groupby('Y').apply(lambda x: x.sample(frac=.1))

To perform stratified sampling with respect to more than one variable, just group with respect to more variables. It may be necessary to construct new binned variables to this end.

However, if the group size is too small w.r.t. the proportion like groupsize 1 and propotion .25, then no item will be returned. This is due to pythons rounding implementation of the int function int(0.25)=0

108

answered Oct 12 '22 12:10

Quickbeam2k1

I'm not totally sure whether you mean this:

strats = []
for k in range(11):
    y_val = k*0.1
    dummy_df = your_df[your_df['Y'] == y_val]
    stats.append( dummy_df.sample(200) )

That makes a dummy dataframe consisting in only the Y values you want, and then takes a sample of 200.

OK so you need the different chunks to have the same structure. I guess that's a bit harder, here's how I would do it:

First of all, I would get a histogram of what X1 looks like:

hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))

we have now a histogram with nbins bins.

Now the strategy is to draw a certain number of rows depending on what their value of X1 is. We will draw more from the bins with more observations and less from the bins with less, so that the structure of X is preserved.

In particular, the relative contribution of every bin should be:

rel = [float(i) / sum(hist) for i in hist]

This will be something like [0.1, 0.2, 0.1, 0.3, 0.3]

If we want 200 samples, we need to draw:

draws_in_bin = [int(i*200) for i in rel]

Now we know how many observations to draw from every bin:

strats = []
for k in range(11):
        y_val = k*0.1

        #get a dataframe for every value of Y
        dummy_df = your_df[your_df['Y'] == y_val]

        bin_strat = []
        for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin):

             bin_df = dummy_df[ (dummy_df['X1']> left_edge) 
                              & (dummy_df['X1']< right_edge) ]

             bin_strat.append(bin_df.sample(n_draws))
             # this takes the right number of draws out 
             # of the X1 bin where we currently are
             # Note that every element of bin_strat is a dataframe
             # with a number of entries that corresponds to the 
             # structure of draws_in_bin
        #
        #concatenate the dataframes for every bin and append to the list
        strats.append( pd.concat(bin_strat) )

answered Oct 12 '22 14:10

elelias

Related questions
                            
                                String contains any character in group?
                            
                                How do I ONLY round a number/float down in Python?
                            
                                Python: for loop - print on the same line [duplicate]
                            
                                How to get link from elements with Selenium and Python
                            
                                itertools.imap vs map over the entire iterable
                            
                                Get text of children in a div with beautifulsoup
                            
                                Reading JSON files from curl in Python
                            
                                How can I get the x and y dimensions of a ndarray - Numpy / Python
                            
                                How to locate an element by class name and its text in python selenium
                            
                                How to create a new repository with PyGithub
                            
                                app_template_filter with multiple arguments
                            
                                Newly-assignmed variables not showing up in Spyder's variable explorer
                            
                                PYQT layout and setgeometry basic overview
                            
                                jupyter giving 404: Not Found error on WIndows 7
                            
                                How to save image in-memory and upload using PIL?
                            
                                using IFF in python
                            
                                Pycharm does not auto-create documentation stubs
                            
                                Why apply sometimes isn't faster than for-loop in pandas dataframe?
                            
                                Why is AWS telling me BucketAlreadyExists when it doesn't?
                            
                                How can I use colon (:) in variable [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With