I have been working on formatting a log file and finally I have arrived to the following dataframe sample, where the categories and numbers I want to add are in the same column:
df = pd.DataFrame(dict(a=['Cat. A',1,1,3,'Cat. A',2,2,'Cat. B',3,5,2,6,'Cat. B',1,'Cat. C',4]))
>>> a
0 Cat. A
1 1
2 1
3 3
4 Cat. A
5 2
6 2
7 Cat. B
8 3
9 5
10 2
11 6
12 Cat. B
13 1
14 Cat. C
15 4
If I sum all the numbers below each category I want to obtain:
Cat. A= 1+1+3+2+2 = 9
Cat. B= 3+5+2+6+1 = 17
Cat. C= 4
I know how to do with going through all the files in the classic way, but I would like to know how to do it in a most pythonic way, considering that the numbers of rows for each category can be variable, and that the numbers of times that the category appears in each dataframe can be different too.
This is also another way
df = pd.DataFrame(dict(a=['Cat. A',1,1,3,'Cat. A',2,2,'Cat. B',3,5,2,6,'Cat. B',1,'Cat. C',4]))
def coerce(x):
try:
int(x)
return np.nan
except:
return x
def safesum(x):
return x[x!=x.iloc[0]].astype(int).sum()
df['b'] = df['a'].apply(coerce).ffill()
df.groupby('b').agg(safesum)
Produces
a
b
Cat. A 9
Cat. B 17
Cat. C 4
We can use pd.to_numeric
to mark non-numeric fields as nan
using Series.mask
and Series.notna
then use for group. Then use GroupBy.sum
a = pd.to_numeric(df['a'], errors='coerce')
g = df['a'].mask(a.notna()).ffill()
a.groupby(g).sum()
Cat. A 9.0
Cat. B 17.0
Cat. C 4.0
Name: a, dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With