Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Sum of Duplicate Attributes

Tags:

I'm using Pandas to manipulate a csv file with several rows and columns that looks like the following

Fullname     Amount     Date           Zip    State ..... John Joe        1        1/10/1900     55555    Confusion Betty White     5         .             .       Alaska  Bruce Wayne     10        .             .       Frustration John Joe        20        .             .       . Betty White     25        .             .       . 

I'd like to create a new column entitled Total with a total sum of amount for each person. (Identified by Fullname and Zip). I'm having difficulty in finding the correct solution.

Let's just call my csv import csvfile. Here is what I have.

import Pandas df = pandas.read_csv('csvfile.csv', header = 0)  df.sort(['fullname']) 

I think I have to use the iterrows to do what I want as an object. The problem with dropping duplicates is that I will lose the amount or the amount may be different.

like image 548
user2723240 Avatar asked Apr 11 '15 21:04

user2723240


People also ask

How can I count duplicate values in pandas?

You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .

How do I sum the same row in pandas?

To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.

What does sum () do in pandas?

Pandas DataFrame sum() Method The sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.

How do you get only duplicate records in Python?

duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns.


1 Answers

I think you want this:

df['Total'] = df.groupby(['Fullname', 'Zip'])['Amount'].transform('sum') 

So groupby will group by the Fullname and zip columns, as you've stated, we then call transform on the Amount column and calculate the total amount by passing in the string sum, this will return a series with the index aligned to the original df, you can then drop the duplicates afterwards. e.g.

new_df = df.drop_duplicates(subset=['Fullname', 'Zip']) 
like image 107
EdChum Avatar answered Sep 18 '22 21:09

EdChum