Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sampling one record per unique value (pandas, python)

I work with python-pandas dataframes, and I have a large dataframe containing users and their data. Each user can have multiple rows. I want to sample 1-row per user. My current solution seems not efficient:

df1 = pd.DataFrame({'User': ['user1', 'user1', 'user2', 'user3', 'user2', 'user3'],
                 'B': ['B', 'B1', 'B2', 'B3','B4','B5'],
                 'C': ['C', 'C1', 'C2', 'C3','C4','C5'],
                 'D': ['D', 'D1', 'D2', 'D3','D4','D5'],
                 'E': ['E', 'E1', 'E2', 'E3','E4','E5']},
                 index=[0, 1, 2, 3,4,5])

df1
>>  B   C   D   E   User
0   B   C   D   E   user1
1   B1  C1  D1  E1  user1
2   B2  C2  D2  E2  user2
3   B3  C3  D3  E3  user3
4   B4  C4  D4  E4  user2
5   B5  C5  D5  E5  user3

userList = list(df1.User.unique())
userList
> ['user1', 'user2', 'user3']

The I loop over unique users list and sample one row per user, saving them to a different dataframe

usersSample = pd.DataFrame() # empty dataframe, to save samples
for i in userList:
    usersSample=usersSample.append(df1[df1.User == i].sample(1)) 

> usersSample   
B   C   D   E   User
0   B   C   D   E   user1
4   B4  C4  D4  E4  user2
3   B3  C3  D3  E3  user3

Is there a more efficient way of achieving that? I'd really like to: 1) avoid appending to dataframe usersSample. This is gradually growing object and it seriously kills performance. And 2) avoid looping over users one at a time. Is there a way to sample 1-per-user more efficiently?

like image 451
Ruslan Avatar asked Jul 15 '16 07:07

Ruslan


People also ask

How do I extract unique rows in pandas?

And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])

How do you count unique occurrences in pandas?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.

How do I extract unique values from a column in pandas?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.

What does unique () method in pandas do?

Pandas: Series - unique() function The unique() function is used to get unique values of Series object. Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort. The unique values returned as a NumPy array.


1 Answers

This is what you want:

df1.groupby('User').apply(lambda df: df.sample(1))

enter image description here

Without the extra index:

df1.groupby('User', group_keys=False).apply(lambda df: df.sample(1))

enter image description here

like image 75
piRSquared Avatar answered Nov 15 '22 15:11

piRSquared