pandas groupby two columns and summarize by mean

Tags:

pandas

I have a data frame like this:

df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)


    id view     value
0    1    A  0.625781
1    1    B  0.330084
2    1    A  0.024532
3    2    A  0.154651
4    2    B  0.196960
5    3    A  0.393941
6    3    B  0.607217
7    3    A  0.422823
8    3    A  0.994323
9    4    B  0.366650
10   4    A  0.649585
11   5    B  0.513923

I now want to summarize for each id each view by mean of 'value'. Think of this as some ids have repeated observations for view, and I want to summarize them. For example, id 1 has two observations for A.

I tried

res = df.groupby(['id', 'view'])['value'].mean()

This actually almost what I want, but pandas combines the id and view column into one, which I do not want.

id  view
1   A       0.325157
    B       0.330084
2   A       0.154651
    B       0.196960
3   A       0.603696
    B       0.607217
4   A       0.649585
    B       0.366650
5   B       0.513923

also res.shape is of dimension (9,)

my desired output would be this:

id  view    value
1   A       0.325157
1   B       0.330084
2   A       0.154651
2   B       0.196960
3   A       0.603696
3   B       0.607217
4   A       0.649585
4   B       0.366650
5   B       0.513923

where the column names and dimensions are kept and where the id is repeated. Each id should have only 1 row for A and B.

How can I achieve this?

713

asked Feb 03 '17 10:02

spore234

1 Answers

You need reset_index or parameter as_index=False in groupby, because you get MuliIndex and by default the higher levels of the indexes are sparsified to make the console output a bit easier on the eyes:

np.random.seed(100)
df = pd.DataFrame()
df['id'] = [1,1,1,2,2,3,3,3,3,4,4,5]
df['view'] = ['A', 'B', 'A', 'A','B', 'A', 'B', 'A', 'A','B', 'A', 'B']
df['value'] = np.random.random(12)
print (df)
    id view     value
0    1    A  0.543405
1    1    B  0.278369
2    1    A  0.424518
3    2    A  0.844776
4    2    B  0.004719
5    3    A  0.121569
6    3    B  0.670749
7    3    A  0.825853
8    3    A  0.136707
9    4    B  0.575093
10   4    A  0.891322
11   5    B  0.209202

res = df.groupby(['id', 'view'])['value'].mean().reset_index()
print (res)
   id view     value
0   1    A  0.483961
1   1    B  0.278369
2   2    A  0.844776
3   2    B  0.004719
4   3    A  0.361376
5   3    B  0.670749
6   4    A  0.891322
7   4    B  0.575093
8   5    B  0.209202

res = df.groupby(['id', 'view'], as_index=False)['value'].mean()
print (res)
   id view     value
0   1    A  0.483961
1   1    B  0.278369
2   2    A  0.844776
3   2    B  0.004719
4   3    A  0.361376
5   3    B  0.670749
6   4    A  0.891322
7   4    B  0.575093
8   5    B  0.209202

200

answered Nov 26 '22 15:11

jezrael

Related questions
                            
                                Pandas: query string where column name contains special characters
                            
                                Conditionally calculated column for a Pandas DataFrame
                            
                                How can I change the (locale) thousands separator in Python to Arabic Unicode separator?
                            
                                python use Pyyaml and keep format
                            
                                Python pandas select rows by list of dates
                            
                                Vertical alignment of matplotlib legend labels with LaTeX math
                            
                                python - sklearn Latent Dirichlet Allocation Transform v. Fittransform
                            
                                Apache Spark reads for S3: can't pickle thread.lock objects
                            
                                Python: Trimming underscores from end of String
                            
                                Python3 reading a binary file, 4 bytes at a time and xor it with a 4 byte long key
                            
                                How to draw a small graph with community structure in networkx
                            
                                TypeError: __init__() takes 1 positional argument but 2 were given
                            
                                Why xgboost.cv and sklearn.cross_val_score give different results?
                            
                                How do I rename a superclass's method in python?
                            
                                What's the alternative to pandas chain indexing?
                            
                                xlsxwriter not applying format to header row of dataframe - Python Pandas
                            
                                Is it possible to subclass DataFrame in Pyspark?
                            
                                IndexError: tuple index out of range when parsing method arguments
                            
                                keeping track of indices change in numpy.reshape
                            
                                What is row slicing vs What is column slicing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With