Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Block Bootstrapped Sampling in Pandas

Tags:

python

pandas

I'm trying to implement Block Bootstrapping in Pandas.

For example, suppose my DataFrame looks something like:

df = pd.DataFrame({
    'personid': [1, 1, 1, 2, 2, 3, 3, 3, 3],
    'month': ['Jan', 'Feb', 'Mar', 'Aug', 'Sep', 'Mar', 'Apr', 'May', 'Jun'],
    'values': [100, 200, 300, 400, 500, 600, 700, 800, 900],
})

df
     month  personid value
0    Jan    1        100
1    Feb    1        200
2    Mar    1        300
3    Aug    2        400
4    Sep    2        500
5    Mar    3        600
6    Apr    3        700
7    May    3        800
8    Jun    3        900

In particular, the DataFame is unique at the month, personid and in reality contains many rows where each personid is associated with a different number of months.

I want to implement a "block bootstrap" at the personid level. That is, I want to sample with replacement from the set of all unique values in personid and then return a DataFrame from that sample carrying all the associated month and value columns with it.

So for example, I have something like this:

personids = df.personid.unique()

which in this case would result in

 array([1, 2, 3])

Then, I would sample with replacement:

np.random.choice(personids, size=personids.size, replace=True)

In this case, this might result in:

array([3, 3, 2])

So now, if that were the sampling that resulted, I would want a bootstrapped dataframe, call it bootstrapped_df such that bootstrapped_df would equal:

     month  personid value
0    Mar    3        600
1    Apr    3        700
2    May    3        800
3    Jun    3        900
4    Mar    3        600
5    Apr    3        700
6    May    3        800
7    Jun    3        900
8    Aug    2        400
9    Sep    2        500

The way I've done it so far is this:

def create_bootstrapped_df(df, sampled_personids):
    """
    Create "Block" Bootstrapped DataFrame given a vector of sampled_personids

    Keyword Args:
        df: DataFrame containing cost data at the personid, month level
        sampled_personids: A vector of personids that is already sampled with replacement.
    """
    bootstrapped = []
    for person in sampled_personids:
        person_df = df.loc[df.personid == person]
        bootstrapped.append(person_df)
    bootstrapped_sample = pd.concat(bootstrapped)
    bootstrapped_sample.reset_index(drop=True, inplace=True)
    return bootstrapped_sample

Basically what the function does is that it loops through the sampled personid vector and subsets the original Data Frame pulling out each of the personids. It then concats everything together. I'm afraid this is very inefficient. Is there a better way to do this?

like image 428
Vincent Avatar asked Aug 01 '18 19:08

Vincent


2 Answers

Actually, I just figured out a very easy way to do this. If I set personid as the index, then I can subset the DataFrame by the index and it will do the thing I want.

For example, if I do:

sampled_personids = np.random.choice(personids, size=personids.size, replace=True)

that yields me

array([1, 2, 2])

And then if I do:

df.loc[sampled_personids]

I get:

          month personid value
personid
1         Jan   1        100
1         Feb   1        200
1         Mar   1        300
2         Aug   2        400
2         Sep   2        500
2         Aug   2        400
2         Sep   2        500
like image 123
Vincent Avatar answered Sep 24 '22 16:09

Vincent


you can use merge. First create a bootstrapped_df with just the random personids:

bootstrapped_df = pd.DataFrame({'personid':np.random.choice( personids, size=personids.size, 
                                                             replace=True)})

for me, it was:

   personid
0         2
1         1
2         1

then use merge with the parameter how='left':

bootstrapped_df = bootstrapped_df.merge(df,how='left')

and I get for bootstrapped_df:

   personid month  values
0         2   Aug     400
1         2   Sep     500
2         1   Jan     100
3         1   Feb     200
4         1   Mar     300
5         1   Jan     100
6         1   Feb     200
7         1   Mar     300

EDIT you can do everything in one line:

bootstrapped_df = (pd.DataFrame({'personid':np.random.choice( personids, size=personids.size, 
                                                             replace=True)})
                     .merge(df,how='left'))
like image 28
Ben.T Avatar answered Sep 22 '22 16:09

Ben.T