I have been working with a dataframe in python and pandas that contains duplicate entries in the first column. The dataframe looks something like this:
sample_id qual percent 0 sample_1 10 20 1 sample_2 20 30 2 sample_1 50 60 3 sample_2 10 90 4 sample_3 100 20
I want to write something that identifies duplicate entries within the first column and calculates the mean values of the subsequent columns. An ideal output would be something similar to the following:
sample_id qual percent 0 sample_1 30 40 1 sample_2 15 60 2 sample_3 100 20
I have been struggling with this problem all afternoon and would appreciate any help.
You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .
To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.
To find mean of DataFrame, use Pandas DataFrame. mean() function. The DataFrame. mean() function returns the mean of the values for the requested axis.
Pandas DataFrame. duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns.
groupby
the sample_id
column and use mean
df.groupby('sample_id').mean().reset_index()
ordf.groupby('sample_id', as_index=False).mean()
get you
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With