The text file look like:
david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160
mark weight_2005 90
mark weight_2012 85
mark height_2005 160
mark height_2012 170
How to calculate mean of weight and height for david and mark as follows:
david>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
mark>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
my incomplete code is:
import numpy as np
import csv
with open ('data.txt','r') as infile:
contents = csv.reader(infile, delimiter=' ')
c1,c2,c3 = zip(*contents)
data = np.array(c3,dtype=float)
Then how to apply np.mean??
The mean
function is for computing the average of an array of numbers. You will need to come up with a way to select the values of c3
by applying a condition to c2
.
What would probably suit your needs better would be splitting up the data into a hierarchical structure, I prefer using dictionaries. Something like
data = {}
with open('data.txt') as f:
contents = csv.reader(f, delimiter=' ')
for (name, attribute, value) in contents:
data[name] = data.get(name, {}) # Default value is a new dict
attr_name, attr_year = attribute.split('_')
attr_year = int(attr_year)
data[name][attr_name] = data[name].get(attr_name, {})
data[name][attr_name][attr_year] = value
Now data
will look like
{
"david": {
"weight": {
2005: 50,
2012: 60
},
"height": {
2005: 150,
2012: 160
}
},
"mark": {
"weight": {
2005, 90,
2012, 85
},
"height": {
2005: 160,
2012: 170
}
}
}
Then what you can do is
david_avg_weight = np.mean(data['david']['weight'].values())
mark_avg_height = np.mean([v for k, v in data['mark']['height'].iteritems() if 2008 < k])
Here I'm still using np.mean
, but only calling it on a normal Python list.
I'll make this community wiki, because it's more "here's how I think you should do it instead" than "here's the answer to the question you asked". For something like this I'd probably use pandas
instead of numpy
, as its grouping tools are much better. It'll also be useful to compare with numpy
-based approaches.
import pandas as pd
df = pd.read_csv("data.txt", sep="[ _]", header=None,
names=["name", "property", "year", "value"])
means = df.groupby(["name", "property"])["value"].mean()
.. and, er, that's it.
First, read in the data into a DataFrame
, letting either whitespace or _
separate columns:
>>> import pandas as pd
>>> df = pd.read_csv("data.txt", sep="[ _]", header=None,
names=["name", "property", "year", "value"])
>>> df
name property year value
0 david weight 2005 50
1 david weight 2012 60
2 david height 2005 150
3 david height 2012 160
4 mark weight 2005 90
5 mark weight 2012 85
6 mark height 2005 160
7 mark height 2012 170
Then group by name
and property
, take the value
column, and compute the mean:
>>> means = df.groupby(["name", "property"])["value"].mean()
>>> means
name property
david height 155.0
weight 55.0
mark height 165.0
weight 87.5
Name: value, dtype: float64
.. okay, the sep="[ _]"
trick is a little too cute for real code, though it works well enough here. In practice I'd use a whitespace separator, read in the second column as property_year
and then do
df["property"], df["year"] = zip(*df["property_year"].str.split("_"))
del df["property_year"]
to allow underscores in other columns.
You can read your data directly in a numpy array with:
data = np.recfromcsv("data.txt", delimiter=" ", names=['name', 'type', 'value'])
then you can find appropriate indices with np.where :
indices = np.where((data.name == 'david') * data.type.startswith('height'))
and perform the mean on thoses indices :
np.mean(data.value[indices])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With