I am using Python 2.7 (Anaconda) for processing tabular data. I have loaded a textfile with two columns, e.g. <pre class="prettyprint"><code>[[ 1. 8.] [ 2. 4.] [ 3. 1.] [ 4. 5.] [ 5. 6.] [ 1. 9.] [ 2. 0.] [ 3. 7.] [ 4. 3.] [ 5. 2.]] </code></pre> my goal is to calculate the mean over all values in the second column which match the unique values in the first one, e.g. the mean value for 1 would be 8.5, for 2 it would be two, for 3 it would be 4. First, I filtered out the unique values in the first column by extracting the column and applying np.unique() resulting in the array "unique". I created a loop that works when I define the unique value: <pre class="prettyprint"><code>mean= 0 values=[] for i in range(0,len(first),1): if first[i]==1: values.append(second[i]) print(np.mean(values)) </code></pre> where first and second are the specific columns. Now I want to make this not so specific. I tried <pre class="prettyprint"><code>mean = 0 values = [] means=[] for i in unique: for k in range(0,len(first),1): if first[k]==i: values.append(second[k]) mean = np.mean(values) means.append(mean) mean=0 values=[] print(means) </code></pre> but it only returns the original second column. Does anybody have an idea on how to make this code non-specific? In reality, I have about 70k rows, so I cannot do it manually.

Write a comparison statement checking the first column for your unique value, use that statement as a boolean index, <pre class="prettyprint"><code>>>> mask = a[:,0] == 1 >>> a[mask] array([[ 1., 8.], [ 1., 9.]]) for n in np.unique(a[:,0]): mask = a[:,0] == n print(np.mean(a[mask], axis = 0)) >>> [ 1. 8.5] [ 2. 2.] [ 3. 4.] [ 4. 4.] [ 5. 4.] </code></pre> <hr> If your data file looks something like this <pre class="prettyprint"><code>''' 1., 8. 2., 4. 3., 1. 4., 5. ''' </code></pre> and you don't really need a numpy array, just use a dictionary: <pre class="prettyprint"><code>import collections d = collections.defaultdict(list) with open('file.txt') as f: for line in f: line = line.strip() first, second = map(float, line.split(',')) d[first].append(second) for first, second in d.iteritems(): print(first, sum(second) / len(second)) </code></pre>

Mean of values in a column for unique values in another column

Tags:

python

mean

tabular

I am using Python 2.7 (Anaconda) for processing tabular data. I have loaded a textfile with two columns, e.g.

[[ 1.  8.]
 [ 2.  4.]
 [ 3.  1.]
 [ 4.  5.]
 [ 5.  6.]
 [ 1.  9.]
 [ 2.  0.]
 [ 3.  7.]
 [ 4.  3.]
 [ 5.  2.]]

my goal is to calculate the mean over all values in the second column which match the unique values in the first one, e.g. the mean value for 1 would be 8.5, for 2 it would be two, for 3 it would be 4. First, I filtered out the unique values in the first column by extracting the column and applying np.unique() resulting in the array "unique". I created a loop that works when I define the unique value:

mean= 0
values=[]
for i in range(0,len(first),1):
    if first[i]==1:
        values.append(second[i])
print(np.mean(values))

where first and second are the specific columns. Now I want to make this not so specific. I tried

mean = 0
values = []
means=[]

for i in unique:
    for k in range(0,len(first),1):
        if first[k]==i:
            values.append(second[k])
            mean = np.mean(values)
            means.append(mean)
    mean=0
    values=[]
print(means)

but it only returns the original second column. Does anybody have an idea on how to make this code non-specific? In reality, I have about 70k rows, so I cannot do it manually.

499

asked Sep 09 '16 04:09

Maurus

2 Answers

In pandas, you can achieve this by using groupby:

In [97]: data
Out[97]: 
array([[ 1.,  8.],
       [ 2.,  4.],
       [ 3.,  1.],
       [ 4.,  5.],
       [ 5.,  6.],
       [ 1.,  9.],
       [ 2.,  0.],
       [ 3.,  7.],
       [ 4.,  3.],
       [ 5.,  2.]])

In [98]: import pandas as pd

In [99]: df = pd.DataFrame(data, columns=['first', 'second'])

In [100]: df.groupby('first').mean().reset_index()
Out[100]: 
   first  second
0    1.0     8.5
1    2.0     2.0
2    3.0     4.0
3    4.0     4.0
4    5.0     4.0

answered Nov 02 '22 11:11

Nehal J Wani

Write a comparison statement checking the first column for your unique value, use that statement as a boolean index,

>>> mask = a[:,0] == 1
>>> a[mask]
array([[ 1.,  8.],
       [ 1.,  9.]])

for n in np.unique(a[:,0]):
    mask = a[:,0] == n
    print(np.mean(a[mask], axis = 0))

>>> 
[ 1.   8.5]
[ 2.  2.]
[ 3.  4.]
[ 4.  4.]
[ 5.  4.]

If your data file looks something like this

'''
1.,  8.
2.,  4.
3.,  1.
4.,  5.
'''

and you don't really need a numpy array, just use a dictionary:

import collections
d = collections.defaultdict(list)
with open('file.txt') as f:
    for line in f:
        line = line.strip()
        first, second = map(float, line.split(','))
        d[first].append(second)

for first, second in d.iteritems():
    print(first, sum(second) / len(second))

answered Nov 02 '22 11:11

wwii

Related questions
                            
                                Write Custom Python-Based Gradient Function for an Operation? (without C++ Implementation)
                            
                                python smallest range from multiple lists
                            
                                Best way to take mean/sum of block matrix in numpy? [duplicate]
                            
                                Fit t distribution using scipy with predetermined mean and std(loc & scale)?
                            
                                Capture interactive Python shell output along with input
                            
                                Pandas set_Value with DatetimeIndex [Python]
                            
                                Spark: equivelant of zipwithindex in dataframe
                            
                                Inherit namedtuple from a base class in python
                            
                                Using itertools for arbitrary number of nested loops of different ranges with dependencies?
                            
                                How to print the console to a text file AFTER the program finishes (Python)?
                            
                                How to sort numpy array by absolute value of a column?
                            
                                Interpolate sleep() and print() in the same line inside a for loop using python 3 [duplicate]
                            
                                using sdl2 in Kivy instead of pygame
                            
                                python convolution with different dimension
                            
                                read_fwf in pandas in Python does not use comment character if colspecs argument does not include first column
                            
                                Does filter,map, and reduce in Python create a new copy of list?
                            
                                How to read data in chunks in Python dataframe?
                            
                                Convert numbered pinyin to pinyin with tone marks
                            
                                Django foreign key relation in template
                            
                                Cython & C++: passing by reference

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With