I have a data frame like this:
df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)
id view value
0 1 A 0.625781
1 1 B 0.330084
2 1 A 0.024532
3 2 A 0.154651
4 2 B 0.196960
5 3 A 0.393941
6 3 B 0.607217
7 3 A 0.422823
8 3 A 0.994323
9 4 B 0.366650
10 4 A 0.649585
11 5 B 0.513923
I now want to summarize for each id
each view
by mean of 'value'.
Think of this as some ids have repeated observations for view, and I want to summarize them. For example, id 1 has two observations for A.
I tried
res = df.groupby(['id', 'view'])['value'].mean()
This actually almost what I want, but pandas combines the id
and view
column into one, which I do not want.
id view
1 A 0.325157
B 0.330084
2 A 0.154651
B 0.196960
3 A 0.603696
B 0.607217
4 A 0.649585
B 0.366650
5 B 0.513923
also res.shape is of dimension (9,)
my desired output would be this:
id view value
1 A 0.325157
1 B 0.330084
2 A 0.154651
2 B 0.196960
3 A 0.603696
3 B 0.607217
4 A 0.649585
4 B 0.366650
5 B 0.513923
where the column names and dimensions are kept and where the id is repeated. Each id should have only 1 row for A and B.
How can I achieve this?
Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.
Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.
To calculate mean values grouped on another column in pandas, we will use groupby, and then we will apply mean() method. Pandas allow us a direct method called mean() which calculates the average of the set passed into it.
Apply a function groupby to each row or column of a DataFrame. Groupby one column and return the mean of the remaining columns in each group. Groupby two columns and return the mean of the remaining column. Groupby one column and return the mean of only particular column in the group.
You need reset_index
or parameter as_index=False
in groupby
, because you get MuliIndex
and by default the higher levels of the indexes are sparsified to make the console output a bit easier on the eyes:
np.random.seed(100)
df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)
print (df)
id view value
0 1 A 0.543405
1 1 B 0.278369
2 1 A 0.424518
3 2 A 0.844776
4 2 B 0.004719
5 3 A 0.121569
6 3 B 0.670749
7 3 A 0.825853
8 3 A 0.136707
9 4 B 0.575093
10 4 A 0.891322
11 5 B 0.209202
res = df.groupby(['id', 'view'])['value'].mean().reset_index()
print (res)
id view value
0 1 A 0.483961
1 1 B 0.278369
2 2 A 0.844776
3 2 B 0.004719
4 3 A 0.361376
5 3 B 0.670749
6 4 A 0.891322
7 4 B 0.575093
8 5 B 0.209202
res = df.groupby(['id', 'view'], as_index=False)['value'].mean()
print (res)
id view value
0 1 A 0.483961
1 1 B 0.278369
2 2 A 0.844776
3 2 B 0.004719
4 3 A 0.361376
5 3 B 0.670749
6 4 A 0.891322
7 4 B 0.575093
8 5 B 0.209202
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With