I have a pandas series where the values vary between a few different users. What I wanted to do was grab a random sample from each user, and return the index values of the random sample.
The series looks something like this (each user appears on multiple rows):
index
row1 user1
row2 user2
row3 user2
row4 user1
row5 user2
row6 user1
row7 user3
...
The function I wrote looks like this:
def get_random_sample(series, sample_size, users):
""" Grab a random sample of size sample_size of the tickets resolved by each user in the list users.
Series has the ticket number as index, and the username as the series values.
Returns a dict {user:[sample_tickets]}
"""
sample_dict = {}
for user in users:
sample_dict[user] = series[series==user].sample(n=sample_size, replace=False)
return sample_dict
What's being returned is the following:
# assuming sample_size is 4
{user1: [user1, user1, user1, user1],
user2: [user2, user2, user2, user2],
...}
But what I want to get for my output is:
{user1: [row1, row6, row32, row40],
user2: [row3, row5, row17, row39],
...}
# where row# is the index label for the corresponding row.
Basically I want to have pandas series.sample() to return the indices of the random sample items instead of the item values. Not sure if this is possible or if I'm better off restructuring my data first (maybe have the users as the series names in a dataframe and the indices become the values under that series? not sure how to do this though). Any insight is appreciated.
As @user48956 commented on accepted answer, is much faster to sample over the index using numpy.random.choice
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(10000000, 4)), columns=list('ABCD'))
%time df.sample(100000).index
print(_)
%time pd.Index(np.random.choice(df.index, 100000))
Wall time: 710 ms
Int64Index([7141956, 9256789, 1919656, 2407372, 9181191, 2474961, 2345700,
4394530, 8864037, 6096638,
...
471501, 3616956, 9397742, 6896140, 670892, 9546169, 4146996,
3465455, 7748682, 5271367],
dtype='int64', length=100000)
Wall time: 6.05 ms
Int64Index([7141956, 9256789, 1919656, 2407372, 9181191, 2474961, 2345700,
4394530, 8864037, 6096638,
...
471501, 3616956, 9397742, 6896140, 670892, 9546169, 4146996,
3465455, 7748682, 5271367],
dtype='int64', length=100000)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With