Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python grouping and getting the Average, Minimum and Maximum values

Tags:

python

pandas

csv

I have a csv data set that looks like this:

Class,  Code,   Vendor, State,  NumberOfDays
3,      123,    Name1,  NE,     12.58402778
1,      876,    Name2,  TX,     12.51041667
3,      123,    Name1,  NE,     2.354166667
1,      876,    Name2,  TX,     12.21111111
3,      456,    Name2,  NY,     6.346527778
2,      876,    Name1,  NY,     5.513194444
3,      123,    Name1,  NE,     5.38125
1,      876,    Name2,  TX,     5.409722222

I have code the following code:

df = pd.read_csv(r'C:\Python36\Data\testing\LowHighMean.csv')
df2 = df.groupby(['Class','Code','Vendor','State'])['NumberOfDays'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv(r'C:\Python36\Data\testing\output.csv')

Which works great in getting me the average 'NumberOfDays' by grouping the other fields:

1,876,Name2,TX,10.04
2,876,Name1,NY,5.51
3,123,Name1,NE,6.77
3,456,Name2,NY,6.35

I cannot seem to carry over the headers but it is not a big deal I just put the headers in on another step. The issue I am trying to solve for is to add columns which would provide the lowest min() and highest max() value. I am looking to create this:

Class,  Code,   Vendor, State,  AverageDays, LowestNumberOfDays,    HighestNumberOfDays
1,      876,    Name2,  TX,     10.04,       5.41                   12.51             
2,      876,    Name1,  NY,     5.51,        5.51                   5.51
3,      123,    Name1,  NE,     6.77,        2.35                   12.58
3,      456,    Name2,  NY,     6.35,        6.35                   6.35   

The size of the starting data files are over 3 gig and over 30 million records. The file size becomes much smaller after the transformation. Due to the starting file size I am trying to figure out a way where I can avoid doing this in four different steps. 3 separate steps/runs to get mean(), max() and min() and then a forth run to combine them. Since I am a noob I don't even know how to do this with out setting up 4 sets of code and running the file 4 individual times.

like image 828
John Minze Avatar asked Apr 27 '26 14:04

John Minze


1 Answers

Use aggregation by agg, then is necessary rename columns:

d = {'mean':'AverageDays','min':'LowestNumberOfDays','max':'HighestNumberOfDays'}
df = (df.groupby(['Class','Code','Vendor','State'])['NumberOfDays']
        .agg(['mean','min','max'])
        .rename(columns=d)
        .reset_index())
print (df)
   Class  Code Vendor State  AverageDays  LowestNumberOfDays  \
0      1   876  Name2    TX    10.043750            5.409722   
1      2   876  Name1    NY     5.513194            5.513194   
2      3   123  Name1    NE     6.773148            2.354167   
3      3   456  Name2    NY     6.346528            6.346528   

   HighestNumberOfDays  
0            12.510417  
1             5.513194  
2            12.584028  
3             6.346528  

Thanks for alternative solution, Bharath shetty:

df = df.pivot_table(index=['Class','Code','Vendor','State'],
                    value‌​s='NumberOfDays',
                    agg‌​func=('min','mean','‌​max'))
        .rename(column‌​s=d)
        .reset_index() 
like image 107
jezrael Avatar answered Apr 30 '26 03:04

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!