I'm trying to implement Block Bootstrapping in Pandas.
For example, suppose my DataFrame looks something like:
df = pd.DataFrame({
'personid': [1, 1, 1, 2, 2, 3, 3, 3, 3],
'month': ['Jan', 'Feb', 'Mar', 'Aug', 'Sep', 'Mar', 'Apr', 'May', 'Jun'],
'values': [100, 200, 300, 400, 500, 600, 700, 800, 900],
})
df
month personid value
0 Jan 1 100
1 Feb 1 200
2 Mar 1 300
3 Aug 2 400
4 Sep 2 500
5 Mar 3 600
6 Apr 3 700
7 May 3 800
8 Jun 3 900
In particular, the DataFame is unique at the month
, personid
and in reality contains many rows where each personid
is associated with a different number of months.
I want to implement a "block bootstrap" at the personid
level. That is, I want to sample with replacement from the set of all unique values in personid
and then return a DataFrame from that sample carrying all the associated month
and value
columns with it.
So for example, I have something like this:
personids = df.personid.unique()
which in this case would result in
array([1, 2, 3])
Then, I would sample with replacement:
np.random.choice(personids, size=personids.size, replace=True)
In this case, this might result in:
array([3, 3, 2])
So now, if that were the sampling that resulted, I would want a bootstrapped dataframe, call it bootstrapped_df
such that bootstrapped_df
would equal:
month personid value
0 Mar 3 600
1 Apr 3 700
2 May 3 800
3 Jun 3 900
4 Mar 3 600
5 Apr 3 700
6 May 3 800
7 Jun 3 900
8 Aug 2 400
9 Sep 2 500
The way I've done it so far is this:
def create_bootstrapped_df(df, sampled_personids):
"""
Create "Block" Bootstrapped DataFrame given a vector of sampled_personids
Keyword Args:
df: DataFrame containing cost data at the personid, month level
sampled_personids: A vector of personids that is already sampled with replacement.
"""
bootstrapped = []
for person in sampled_personids:
person_df = df.loc[df.personid == person]
bootstrapped.append(person_df)
bootstrapped_sample = pd.concat(bootstrapped)
bootstrapped_sample.reset_index(drop=True, inplace=True)
return bootstrapped_sample
Basically what the function does is that it loops through the sampled personid vector and subsets the original Data Frame pulling out each of the personids. It then concats everything together. I'm afraid this is very inefficient. Is there a better way to do this?
Actually, I just figured out a very easy way to do this. If I set personid
as the index, then I can subset the DataFrame by the index and it will do the thing I want.
For example, if I do:
sampled_personids = np.random.choice(personids, size=personids.size, replace=True)
that yields me
array([1, 2, 2])
And then if I do:
df.loc[sampled_personids]
I get:
month personid value
personid
1 Jan 1 100
1 Feb 1 200
1 Mar 1 300
2 Aug 2 400
2 Sep 2 500
2 Aug 2 400
2 Sep 2 500
you can use merge
. First create a bootstrapped_df
with just the random personids
:
bootstrapped_df = pd.DataFrame({'personid':np.random.choice( personids, size=personids.size,
replace=True)})
for me, it was:
personid
0 2
1 1
2 1
then use merge
with the parameter how='left'
:
bootstrapped_df = bootstrapped_df.merge(df,how='left')
and I get for bootstrapped_df
:
personid month values
0 2 Aug 400
1 2 Sep 500
2 1 Jan 100
3 1 Feb 200
4 1 Mar 300
5 1 Jan 100
6 1 Feb 200
7 1 Mar 300
EDIT you can do everything in one line:
bootstrapped_df = (pd.DataFrame({'personid':np.random.choice( personids, size=personids.size,
replace=True)})
.merge(df,how='left'))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With