Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

return the index using pandas series.sample()?

I have a pandas series where the values vary between a few different users. What I wanted to do was grab a random sample from each user, and return the index values of the random sample.

The series looks something like this (each user appears on multiple rows):

index    
row1    user1
row2    user2
row3    user2
row4    user1
row5    user2
row6    user1
row7    user3
...

The function I wrote looks like this:

def get_random_sample(series, sample_size, users):
""" Grab a random sample of size sample_size of the tickets resolved by each user in the list users.
    Series has the ticket number as index, and the username as the series values.
    Returns a dict {user:[sample_tickets]}
"""
    sample_dict = {}
    for user in users: 
        sample_dict[user] = series[series==user].sample(n=sample_size, replace=False) 

    return sample_dict

What's being returned is the following:

# assuming sample_size is 4
{user1: [user1, user1, user1, user1],
 user2: [user2, user2, user2, user2],
...}

But what I want to get for my output is:

{user1: [row1, row6, row32, row40],
 user2: [row3, row5, row17, row39],
...}
# where row# is the index label for the corresponding row.

Basically I want to have pandas series.sample() to return the indices of the random sample items instead of the item values. Not sure if this is possible or if I'm better off restructuring my data first (maybe have the users as the series names in a dataframe and the indices become the values under that series? not sure how to do this though). Any insight is appreciated.

like image 601
andraiamatrix Avatar asked Aug 30 '17 19:08

andraiamatrix


1 Answers

As @user48956 commented on accepted answer, is much faster to sample over the index using numpy.random.choice

np.random.seed(42)
df = pd.DataFrame(np.random.randint(0,100,size=(10000000, 4)), columns=list('ABCD'))
%time df.sample(100000).index
print(_)
%time pd.Index(np.random.choice(df.index, 100000))
Wall time: 710 ms
Int64Index([7141956, 9256789, 1919656, 2407372, 9181191, 2474961, 2345700,
            4394530, 8864037, 6096638,
            ...
             471501, 3616956, 9397742, 6896140,  670892, 9546169, 4146996,
            3465455, 7748682, 5271367],
           dtype='int64', length=100000)
Wall time: 6.05 ms

Int64Index([7141956, 9256789, 1919656, 2407372, 9181191, 2474961, 2345700,
            4394530, 8864037, 6096638,
            ...
             471501, 3616956, 9397742, 6896140,  670892, 9546169, 4146996,
            3465455, 7748682, 5271367],
           dtype='int64', length=100000)
like image 143
Ameb Avatar answered Oct 04 '22 02:10

Ameb