Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculate mean using numpy ndarray

The text file look like:

david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160
mark weight_2005 90
mark weight_2012 85
mark height_2005 160
mark height_2012 170

How to calculate mean of weight and height for david and mark as follows:

david>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
mark>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)

my incomplete code is:

 import numpy as np
 import csv
 with open ('data.txt','r') as infile:
   contents = csv.reader(infile, delimiter=' ')
   c1,c2,c3 = zip(*contents)
   data = np.array(c3,dtype=float)

Then how to apply np.mean??

like image 751
2964502 Avatar asked Nov 12 '13 16:11

2964502


3 Answers

The mean function is for computing the average of an array of numbers. You will need to come up with a way to select the values of c3 by applying a condition to c2.

What would probably suit your needs better would be splitting up the data into a hierarchical structure, I prefer using dictionaries. Something like

data = {}
with open('data.txt') as f:
    contents = csv.reader(f, delimiter=' ')
for (name, attribute, value) in contents:
    data[name] = data.get(name, {})  # Default value is a new dict
    attr_name, attr_year = attribute.split('_')
    attr_year = int(attr_year)
    data[name][attr_name] = data[name].get(attr_name, {})
    data[name][attr_name][attr_year] = value

Now data will look like

{
    "david": {
        "weight": {
            2005: 50,
            2012: 60
        },
        "height": {
            2005: 150,
            2012: 160
        }
    },
    "mark": {
        "weight": {
            2005, 90,
            2012, 85
        },
        "height": {
            2005: 160,
            2012: 170
        }
    }
}

Then what you can do is

david_avg_weight = np.mean(data['david']['weight'].values())
mark_avg_height = np.mean([v for k, v in data['mark']['height'].iteritems() if 2008 < k])

Here I'm still using np.mean, but only calling it on a normal Python list.

like image 177
bheklilr Avatar answered Sep 19 '22 11:09

bheklilr


I'll make this community wiki, because it's more "here's how I think you should do it instead" than "here's the answer to the question you asked". For something like this I'd probably use pandas instead of numpy, as its grouping tools are much better. It'll also be useful to compare with numpy-based approaches.

import pandas as pd
df = pd.read_csv("data.txt", sep="[ _]", header=None, 
                 names=["name", "property", "year", "value"])
means = df.groupby(["name", "property"])["value"].mean()

.. and, er, that's it.


First, read in the data into a DataFrame, letting either whitespace or _ separate columns:

>>> import pandas as pd
>>> df = pd.read_csv("data.txt", sep="[ _]", header=None, 
                 names=["name", "property", "year", "value"])
>>> df
    name property  year  value
0  david   weight  2005     50
1  david   weight  2012     60
2  david   height  2005    150
3  david   height  2012    160
4   mark   weight  2005     90
5   mark   weight  2012     85
6   mark   height  2005    160
7   mark   height  2012    170

Then group by name and property, take the value column, and compute the mean:

>>> means = df.groupby(["name", "property"])["value"].mean()
>>> means
name   property
david  height      155.0
       weight       55.0
mark   height      165.0
       weight       87.5
Name: value, dtype: float64

.. okay, the sep="[ _]" trick is a little too cute for real code, though it works well enough here. In practice I'd use a whitespace separator, read in the second column as property_year and then do

df["property"], df["year"] = zip(*df["property_year"].str.split("_"))
del df["property_year"]

to allow underscores in other columns.

like image 22
DSM Avatar answered Sep 18 '22 11:09

DSM


You can read your data directly in a numpy array with:

data = np.recfromcsv("data.txt", delimiter=" ", names=['name', 'type', 'value'])

then you can find appropriate indices with np.where :

indices = np.where((data.name == 'david') * data.type.startswith('height'))

and perform the mean on thoses indices :

np.mean(data.value[indices])
like image 38
Nicolas Barbey Avatar answered Sep 18 '22 11:09

Nicolas Barbey