I work with python-pandas dataframes, and I have a large dataframe containing users and their data. Each user can have multiple rows. I want to sample 1-row per user. My current solution seems not efficient:
df1 = pd.DataFrame({'User': ['user1', 'user1', 'user2', 'user3', 'user2', 'user3'],
'B': ['B', 'B1', 'B2', 'B3','B4','B5'],
'C': ['C', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D', 'D1', 'D2', 'D3','D4','D5'],
'E': ['E', 'E1', 'E2', 'E3','E4','E5']},
index=[0, 1, 2, 3,4,5])
df1
>> B C D E User
0 B C D E user1
1 B1 C1 D1 E1 user1
2 B2 C2 D2 E2 user2
3 B3 C3 D3 E3 user3
4 B4 C4 D4 E4 user2
5 B5 C5 D5 E5 user3
userList = list(df1.User.unique())
userList
> ['user1', 'user2', 'user3']
The I loop over unique users list and sample one row per user, saving them to a different dataframe
usersSample = pd.DataFrame() # empty dataframe, to save samples
for i in userList:
usersSample=usersSample.append(df1[df1.User == i].sample(1))
> usersSample
B C D E User
0 B C D E user1
4 B4 C4 D4 E4 user2
3 B3 C3 D3 E3 user3
Is there a more efficient way of achieving that? I'd really like to: 1) avoid appending to dataframe usersSample. This is gradually growing object and it seriously kills performance. And 2) avoid looping over users one at a time. Is there a way to sample 1-per-user more efficiently?
And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.
Pandas: Series - unique() function The unique() function is used to get unique values of Series object. Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort. The unique values returned as a NumPy array.
This is what you want:
df1.groupby('User').apply(lambda df: df.sample(1))
Without the extra index:
df1.groupby('User', group_keys=False).apply(lambda df: df.sample(1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With