I have a dataframe with name(person_name), color(shirt_color) as columns Each person wears a shirt with a certain color on a particular day (number of days can be arbitrary)
eg input:
name color
----------------
John White
John White
John Blue
John Blue
John White
Tom Blue
Tom Blue
Tom Green
Tom Black
Jerry Black
Jerry Blue
Jerry Black
I need to find the best colored shirt worn by each person, according to best_color_order eg result:
name color
-------------
Jerry Blue
John White
Tom Green
I am doing the following :
best_color_order = ['White', 'Green', 'Blue', 'Black']
best_color_list = [[name, list(group['color']).sort(key=best_color_order.index)[0]]
for name, group in df.groupby(by='name', sort=False, as_index=False)]
best_color_df = pd.DataFrame(best_color_list, columns=['name', 'color'])
Is there a faster way to do this if I have millions of records?
Convert the column color to an ordered categorical so it will sort in your desired order, then sort the values by color, and finally groupby and take the first value for each name:
best_color_order = ['White', 'Green', 'Blue', 'Black']
df['color'] = pd.Categorical(df['color'], categories = best_color_order, ordered=True)
df.sort_values('color').groupby('name').first()
color
name
Jerry Blue
John White
Tom Green
[EDIT]: A faster way might be to do the same, but instead of groupby, just drop the duplicate names and keep the first (which is the default for the function drop_duplicates):
best_color_order = ['White', 'Green', 'Blue', 'Black']
df['color'] = pd.Categorical(df['color'], categories = best_color_order, ordered=True)
df.sort_values('color').drop_duplicates('name')
name color
0 John White
7 Tom Green
10 Jerry Blue
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With