I'm using Pandas to manipulate a csv file with several rows and columns that looks like the following
Fullname Amount Date Zip State ..... John Joe 1 1/10/1900 55555 Confusion Betty White 5 . . Alaska Bruce Wayne 10 . . Frustration John Joe 20 . . . Betty White 25 . . .
I'd like to create a new column entitled Total
with a total sum of amount for each person. (Identified by Fullname
and Zip
). I'm having difficulty in finding the correct solution.
Let's just call my csv import csvfile. Here is what I have.
import Pandas df = pandas.read_csv('csvfile.csv', header = 0) df.sort(['fullname'])
I think I have to use the iterrows to do what I want as an object. The problem with dropping duplicates is that I will lose the amount or the amount may be different.
You can count the number of duplicate rows by counting True in pandas. Series obtained with duplicated() . The number of True can be counted with sum() method. If you want to count the number of False (= the number of non-duplicate rows), you can invert it with negation ~ and then count True with sum() .
To sum all the rows of a DataFrame, use the sum() function and set the axis value as 1. The value axis 1 will add the row values.
Pandas DataFrame sum() Method The sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.
duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns.
I think you want this:
df['Total'] = df.groupby(['Fullname', 'Zip'])['Amount'].transform('sum')
So groupby
will group by the Fullname
and zip
columns, as you've stated, we then call transform
on the Amount
column and calculate the total amount by passing in the string sum
, this will return a series with the index aligned to the original df
, you can then drop the duplicates afterwards. e.g.
new_df = df.drop_duplicates(subset=['Fullname', 'Zip'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With