Sampling one record per unique value (pandas, python)

Tags:

I work with python-pandas dataframes, and I have a large dataframe containing users and their data. Each user can have multiple rows. I want to sample 1-row per user. My current solution seems not efficient:

df1 = pd.DataFrame({'User': ['user1', 'user1', 'user2', 'user3', 'user2', 'user3'],
                 'B': ['B', 'B1', 'B2', 'B3','B4','B5'],
                 'C': ['C', 'C1', 'C2', 'C3','C4','C5'],
                 'D': ['D', 'D1', 'D2', 'D3','D4','D5'],
                 'E': ['E', 'E1', 'E2', 'E3','E4','E5']},
                 index=[0, 1, 2, 3,4,5])

df1
>>  B   C   D   E   User
0   B   C   D   E   user1
1   B1  C1  D1  E1  user1
2   B2  C2  D2  E2  user2
3   B3  C3  D3  E3  user3
4   B4  C4  D4  E4  user2
5   B5  C5  D5  E5  user3

userList = list(df1.User.unique())
userList
> ['user1', 'user2', 'user3']

The I loop over unique users list and sample one row per user, saving them to a different dataframe

usersSample = pd.DataFrame() # empty dataframe, to save samples
for i in userList:
    usersSample=usersSample.append(df1[df1.User == i].sample(1)) 

> usersSample   
B   C   D   E   User
0   B   C   D   E   user1
4   B4  C4  D4  E4  user2
3   B3  C3  D3  E3  user3

Is there a more efficient way of achieving that? I'd really like to: 1) avoid appending to dataframe usersSample. This is gradually growing object and it seriously kills performance. And 2) avoid looping over users one at a time. Is there a way to sample 1-per-user more efficiently?

451

asked Jul 15 '16 07:07

Ruslan

1 Answers

This is what you want:

df1.groupby('User').apply(lambda df: df.sample(1))

enter image description here

Without the extra index:

df1.groupby('User', group_keys=False).apply(lambda df: df.sample(1))

enter image description here

answered Nov 15 '22 15:11

piRSquared

Related questions
                            
                                How does garbage collection and scoping work in C#? [duplicate]
                            
                                Adding words to nltk stoplist
                            
                                Separating file extensions using python os.path module
                            
                                How to use os.umask() in Python
                            
                                python multiprocessing apply_async only uses one process
                            
                                GAE - AppEngine - DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL:
                            
                                Python: sorting dictionary of dictionaries
                            
                                vim-flake8 is not working
                            
                                Checking if a Django user has a password set
                            
                                How to install a missing python package from inside the script that needs it?
                            
                                PyQt4 center window on active screen
                            
                                How to deploy structured Flask app on AWS elastic beanstalk
                            
                                Show the values in the grid using matplotlib
                            
                                stack bar plot in matplotlib and add label to each section
                            
                                Requests - get content-type/size without fetching the whole page/content
                            
                                How to obtain current instance ID from boto3?
                            
                                Proper way to bulk_create for ManyToMany field, Django?
                            
                                Convert datetime columns to a different timezone pandas
                            
                                Why is calling float() on a number slower than adding 0.0 in Python?
                            
                                Python. Get structure from a data.frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sampling one record per unique value (pandas, python)

Tags:

python

pandas

dataframe

group-by

pandas-groupby

Ruslan

People also ask

1 Answers

piRSquared

Recent Activity

Donate For Us