I am new to Python and I have a set of values like the following:
(3, '655')
(3, '645')
(3, '641')
(4, '602')
(4, '674')
(4, '620')
This is generated from a CSV file with the following code (python 2.6):
import csv
import time
with open('file.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
date = time.strptime(row[3], "%a %b %d %H:%M:%S %Z %Y")
data = date, row[5]
month = data[0][1]
avg = data[1]
monthAvg = month, avg
print monthAvg
What I would like to do is get an average of the values based on the keys:
(3, 647)
(4, 632)
My initial thought was to create a new dictionary.
loop through the original dictionary
if the key does not exist
add the key and value to the new dictionary
else
sum the value to the existing value in the new dictionary
I'd also have to keep a count of the number of keys so I could produce the average. Seems like a lot of work though - I wasn't sure if there was a more elegant way to accomplish this.
Thank you.
You can use collections.defaultdict
to create a dictionary with unique keys and lists of values:
>>> l=[(3, '655'),(3, '645'),(3, '641'),(4, '602'),(4, '674'),(4, '620')]
>>> from collections import defaultdict
>>> d=defaultdict(list)
>>>
>>> for i,j in l:
... d[i].append(int(j))
...
>>> d
defaultdict(<type 'list'>, {3: [655, 645, 641], 4: [602, 674, 620]})
Then use a list comprehension to create the expected pairs:
>>> [(i,sum(j)/len(j)) for i,j in d.items()]
[(3, 647), (4, 632)]
And within your code you can do:
with open('file.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
date = time.strptime(row[3], "%a %b %d %H:%M:%S %Z %Y")
data = date, row[5]
month = data[0][1]
avg = data[1]
d[month].append(int(avg))
print [(i,sum(j)/len(j)) for i,j in d.items()]
Use pandas
, it is designed specifically to do these sorts of things, meaning you can express them in only a small amount of code (what you want to do is a one-liner). Further, it will be much, much faster than any of the other approaches when given a lot of values.
import pandas as pd
a=[(3, '655'),
(3, '645'),
(3, '641'),
(4, '602'),
(4, '674'),
(4, '620')]
res = pd.DataFrame(a).astype('float').groupby(0).mean()
print(res)
Gives:
1
0
3 647
4 632
Here is a multi-line version, showing what happens:
df = pd.DataFrame(a) # construct a structure containing data
df = df.astype('float') # convert data to float values
grp = df.groupby(0) # group the values by the value in the first column
df = grp.mean() # take the mean of each group
Further, if you want to use a csv
file, it is even easier since you don't need to parse the csv
file yourself (I use made-up names for the columns I don't know):
import pandas as pd
df = pd.read_csv('file.csv', columns=['col0', 'col1', 'col2', 'date', 'col4', 'data'], index=False, header=None)
df['month'] = pd.DatetimeIndex(df['date']).month
df = df.loc[:,('month', 'data')].groupby('month').mean()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With