Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas: conditionally select a uniform sample from a dataframe

Say I have a dataframe as such

category1  category2   other_col   another_col ....
a          1
a          2
a          2        
a          3
a          3
a          1
b          10
b          10
b          10
b          11
b          11
b          11

I want to obtain a sample from my dataframe so that category1 a uniform number of times. I'm assuming that there are an equal number of each type in category1. I know that this can be done with pandas using pandas.sample(). However, I also want to ensure that that sample I select has category2 equally represented as well. So, for example, if I have a sample size of 5, I would want something such as:

a  1
a  2
b  10
b  11
b  10

I would not want something such as:

a 1
a 1
b 10
b 10
b 10

While this is a valid random sample of n=4, it would not meet my requirements as I want to vary as much as possible the types of category2.

Notice that in the first example, because a was only sampled twice, that 3 was not not represented from category2. This is okay. The goal is to just as uniformly as possible, represent that sample data.

If it helps to provide a clearer example, one could thing having the categories fruit, vegetables, meat, grains, junk. In a sample size of 10, I would want as much as possible to represent each category. So ideally, 2 of each. Then each of those 2 selected rows belonging to the chosen categories would have subcategories that are also represented as uniformly as possible. So, for example, fruit could have a subcategories of red_fruits, yellow_fruits, etc. For the 2 fruit categories that are selected of the 10, red_fruits and yellow_fruits would both represented in the sample. Of course, if we had larger sample size, we would include more of the subcategories of fruit (green_fruits, blue_fruits, etc.).

like image 882
TheRealFakeNews Avatar asked Sep 12 '16 19:09

TheRealFakeNews


2 Answers

This is straightforward when you use the weights keyword in df.sample:

>>>  df.sample(n = 5, weights = (df['category2'].value_counts()/len(df['category2']))**-1)

output:

    category1   category2
2   "a"         2
1   "a"         2
10  "b"         11
3   "a"         3
11  "b"         11

To explain, the weights look like this:

11    4.0
10    4.0
3     6.0
2     6.0
1     6.0

I just took the percentage count for each value in df['category2'] and then inverted those values, which makes for a nice uniform weight across all values in the series.

like image 54
Brian Avatar answered Oct 22 '22 15:10

Brian


Here is a solution that does a true random sample stratified by group (won't get you equal samples every time, but does on average which is probably better from a statistical perspective anyway):

import numpy as np
import pandas as pd

def stratified_sample(df, sample_size_per_class, strat_cols):

    if isinstance(strat_cols, str):
        strat_cols = [strat_cols]

    #make randomizer (separately, in case we need it later?)
    vcs = {}
    randomizer = {}
    for c in strat_cols:

        #calculate number of distinct classes relative to sample size
        _vc = df[c].value_counts()
        vcs[c] = (_vc / sample_size_per_class).round(0).astype(int).to_dict()

        #randomizer will divvy up the bigger groups into chunks of size approximate to the smallest
        randomizer[c] = lambda v: np.random.randint(0, vcs[c][v], size=None)

    #apply randomizer
    randomized_cols = []
    for c in strat_cols:
        stratname = '_stratified_%s' % c
        randomized_cols.append(stratname)
        df[stratname] = df[c].apply(randomizer[c])


    #return first random case which should be approximately n_samples
    return df[df[randomized_cols].max(axis=1) == 0][set(df.columns).difference(randomized_cols)]

To test it:

test = pd.DataFrame({'category1':[0,0,0,0,0,0,1,1,1,1,1,1],
                    'category2':[1,2,2,3,3,1,10,10,10,11,11,11]})

lens = []
for i in range(1000):
    lens.append(
        len(
            stratified_sample(test, 3, ['category1','category2'])
        )
    )

print(np.mean(lens))
like image 28
Andrew Manion Avatar answered Oct 22 '22 17:10

Andrew Manion