Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Averaging the values in a dictionary based on the key

I am new to Python and I have a set of values like the following:

(3, '655')
(3, '645')
(3, '641')
(4, '602')
(4, '674')
(4, '620')

This is generated from a CSV file with the following code (python 2.6):

import csv
import time

with open('file.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        date = time.strptime(row[3], "%a %b %d %H:%M:%S %Z %Y")
        data = date, row[5]

        month = data[0][1]
        avg = data[1]
        monthAvg = month, avg
        print monthAvg

What I would like to do is get an average of the values based on the keys:

(3, 647)
(4, 632)

My initial thought was to create a new dictionary.

loop through the original dictionary
    if the key does not exist
        add the key and value to the new dictionary
    else
        sum the value to the existing value in the new dictionary

I'd also have to keep a count of the number of keys so I could produce the average. Seems like a lot of work though - I wasn't sure if there was a more elegant way to accomplish this.

Thank you.

like image 959
JamesE Avatar asked Feb 10 '23 11:02

JamesE


2 Answers

You can use collections.defaultdict to create a dictionary with unique keys and lists of values:

>>> l=[(3, '655'),(3, '645'),(3, '641'),(4, '602'),(4, '674'),(4, '620')]
>>> from collections import defaultdict
>>> d=defaultdict(list)
>>> 
>>> for i,j in l:
...    d[i].append(int(j))
... 
>>> d
defaultdict(<type 'list'>, {3: [655, 645, 641], 4: [602, 674, 620]})

Then use a list comprehension to create the expected pairs:

>>> [(i,sum(j)/len(j)) for i,j in d.items()]
[(3, 647), (4, 632)]

And within your code you can do:

with open('file.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        date = time.strptime(row[3], "%a %b %d %H:%M:%S %Z %Y")
        data = date, row[5]

        month = data[0][1]
        avg = data[1]
        d[month].append(int(avg))

     print [(i,sum(j)/len(j)) for i,j in d.items()]
like image 68
Mazdak Avatar answered Feb 13 '23 02:02

Mazdak


Use pandas, it is designed specifically to do these sorts of things, meaning you can express them in only a small amount of code (what you want to do is a one-liner). Further, it will be much, much faster than any of the other approaches when given a lot of values.

import pandas as pd

a=[(3, '655'),
   (3, '645'),
   (3, '641'),
   (4, '602'),
   (4, '674'),
   (4, '620')]

res = pd.DataFrame(a).astype('float').groupby(0).mean()
print(res)

Gives:

     1
0     
3  647
4  632

Here is a multi-line version, showing what happens:

df = pd.DataFrame(a)  # construct a structure containing data
df = df.astype('float')  # convert data to float values
grp = df.groupby(0)  # group the values by the value in the first column
df = grp.mean()  # take the mean of each group

Further, if you want to use a csv file, it is even easier since you don't need to parse the csv file yourself (I use made-up names for the columns I don't know):

import pandas as pd
df = pd.read_csv('file.csv', columns=['col0', 'col1', 'col2', 'date', 'col4', 'data'], index=False, header=None)
df['month'] = pd.DatetimeIndex(df['date']).month
df = df.loc[:,('month', 'data')].groupby('month').mean()
like image 23
TheBlackCat Avatar answered Feb 13 '23 02:02

TheBlackCat