Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mean of values in a column for unique values in another column

I am using Python 2.7 (Anaconda) for processing tabular data. I have loaded a textfile with two columns, e.g.

[[ 1.  8.]
 [ 2.  4.]
 [ 3.  1.]
 [ 4.  5.]
 [ 5.  6.]
 [ 1.  9.]
 [ 2.  0.]
 [ 3.  7.]
 [ 4.  3.]
 [ 5.  2.]]

my goal is to calculate the mean over all values in the second column which match the unique values in the first one, e.g. the mean value for 1 would be 8.5, for 2 it would be two, for 3 it would be 4. First, I filtered out the unique values in the first column by extracting the column and applying np.unique() resulting in the array "unique". I created a loop that works when I define the unique value:

mean= 0
values=[]
for i in range(0,len(first),1):
    if first[i]==1:
        values.append(second[i])
print(np.mean(values))

where first and second are the specific columns. Now I want to make this not so specific. I tried

mean = 0
values = []
means=[]

for i in unique:
    for k in range(0,len(first),1):
        if first[k]==i:
            values.append(second[k])
            mean = np.mean(values)
            means.append(mean)
    mean=0
    values=[]
print(means)

but it only returns the original second column. Does anybody have an idea on how to make this code non-specific? In reality, I have about 70k rows, so I cannot do it manually.

like image 499
Maurus Avatar asked Sep 09 '16 04:09

Maurus


People also ask

How do you find the mean of a column based on another column pandas?

To calculate the mean of whole columns in the DataFrame, use pandas. Series. mean() with a list of DataFrame columns. You can also get the mean for all numeric columns using DataFrame.

How do I get unique values from multiple columns in a data frame?

Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.

How do I get unique values in a column?

In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.

How to count unique values based on criteria in another column?

We can use the COUNTA function in Excel to count unique values based on criteria in another column more conveniently. We have to do the following. We will write down the following formula in cell G9. This formula is also similar to the one used in method 1. The COUNTA function will count all the unique cell values returned by the UNIQUE.

How to find unique values from a column in Excel?

Now, the most used way to find unique values from a column in Excel is using the Advanced command. You can find this from the Data tab. Take a look at the following dataset: Here, we are going to use the Advanced filter for unique values in a column. First, select any cell from the column. Then, go to the Data Tab. After that, click on Advanced.

What is the difference between unique and distinct values?

Unique values are the items that appear in a dataset only once. Distinct values are all different items in a list, i.e. unique values and 1 st occurrences of duplicate values. And now, let's investigate the most efficient techniques to deal with unique and distinct values in your Excel sheets.

How to quickly select the unique or distinct list in Excel?

To quickly select the unique or distinct list including column headers, filter unique values, click on any cell in the unique list, and then press Ctrl + A. To select distinct or unique values without column headers, filter unique values, select the first cell with data,...


2 Answers

In pandas, you can achieve this by using groupby:

In [97]: data
Out[97]: 
array([[ 1.,  8.],
       [ 2.,  4.],
       [ 3.,  1.],
       [ 4.,  5.],
       [ 5.,  6.],
       [ 1.,  9.],
       [ 2.,  0.],
       [ 3.,  7.],
       [ 4.,  3.],
       [ 5.,  2.]])

In [98]: import pandas as pd

In [99]: df = pd.DataFrame(data, columns=['first', 'second'])

In [100]: df.groupby('first').mean().reset_index()
Out[100]: 
   first  second
0    1.0     8.5
1    2.0     2.0
2    3.0     4.0
3    4.0     4.0
4    5.0     4.0
like image 75
Nehal J Wani Avatar answered Nov 02 '22 11:11

Nehal J Wani


Write a comparison statement checking the first column for your unique value, use that statement as a boolean index,

>>> mask = a[:,0] == 1
>>> a[mask]
array([[ 1.,  8.],
       [ 1.,  9.]])

for n in np.unique(a[:,0]):
    mask = a[:,0] == n
    print(np.mean(a[mask], axis = 0))

>>> 
[ 1.   8.5]
[ 2.  2.]
[ 3.  4.]
[ 4.  4.]
[ 5.  4.]

If your data file looks something like this

'''
1.,  8.
2.,  4.
3.,  1.
4.,  5.
'''

and you don't really need a numpy array, just use a dictionary:

import collections
d = collections.defaultdict(list)
with open('file.txt') as f:
    for line in f:
        line = line.strip()
        first, second = map(float, line.split(','))
        d[first].append(second)

for first, second in d.iteritems():
    print(first, sum(second) / len(second))
like image 42
wwii Avatar answered Nov 02 '22 11:11

wwii