Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: calculating the mean values of duplicate entries in a dataframe

Tags:

I have been working with a dataframe in python and pandas that contains duplicate entries in the first column. The dataframe looks something like this:

    sample_id    qual    percent 0   sample_1      10        20 1   sample_2      20        30 2   sample_1      50        60 3   sample_2      10        90 4   sample_3      100       20 

I want to write something that identifies duplicate entries within the first column and calculates the mean values of the subsequent columns. An ideal output would be something similar to the following:

    sample_id    qual    percent 0   sample_1      30        40 1   sample_2      15        60 2   sample_3      100       20 

I have been struggling with this problem all afternoon and would appreciate any help.

like image 942
David Ross Avatar asked Oct 07 '16 14:10

David Ross


People also ask

How do you count repeated values in pandas?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .

How do you find the mean of all values in a DataFrame?

To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.

How do you calculate mean of data in pandas?

To find mean of DataFrame, use Pandas DataFrame. mean() function. The DataFrame. mean() function returns the mean of the values for the requested axis.

How do I find duplicate values in two columns in pandas?

Pandas DataFrame. duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns.


1 Answers

groupby the sample_id column and use mean

df.groupby('sample_id').mean().reset_index()
or
df.groupby('sample_id', as_index=False).mean()

get you

enter image description here

like image 62
piRSquared Avatar answered Sep 21 '22 09:09

piRSquared