I am using Python 2.7 (Anaconda) for processing tabular data. I have loaded a textfile with two columns, e.g.
[[ 1. 8.]
[ 2. 4.]
[ 3. 1.]
[ 4. 5.]
[ 5. 6.]
[ 1. 9.]
[ 2. 0.]
[ 3. 7.]
[ 4. 3.]
[ 5. 2.]]
my goal is to calculate the mean over all values in the second column which match the unique values in the first one, e.g. the mean value for 1 would be 8.5, for 2 it would be two, for 3 it would be 4. First, I filtered out the unique values in the first column by extracting the column and applying np.unique() resulting in the array "unique". I created a loop that works when I define the unique value:
mean= 0
values=[]
for i in range(0,len(first),1):
if first[i]==1:
values.append(second[i])
print(np.mean(values))
where first and second are the specific columns. Now I want to make this not so specific. I tried
mean = 0
values = []
means=[]
for i in unique:
for k in range(0,len(first),1):
if first[k]==i:
values.append(second[k])
mean = np.mean(values)
means.append(mean)
mean=0
values=[]
print(means)
but it only returns the original second column. Does anybody have an idea on how to make this code non-specific? In reality, I have about 70k rows, so I cannot do it manually.
To calculate the mean of whole columns in the DataFrame, use pandas. Series. mean() with a list of DataFrame columns. You can also get the mean for all numeric columns using DataFrame.
Pandas series aka columns has a unique() method that filters out only unique values from a column. The first output shows only unique FirstNames. We can extend this method using pandas concat() method and concat all the desired columns into 1 single column and then find the unique of the resultant column.
In Excel, there are several ways to filter for unique values—or remove duplicate values: To filter for unique values, click Data > Sort & Filter > Advanced. To remove duplicate values, click Data > Data Tools > Remove Duplicates.
We can use the COUNTA function in Excel to count unique values based on criteria in another column more conveniently. We have to do the following. We will write down the following formula in cell G9. This formula is also similar to the one used in method 1. The COUNTA function will count all the unique cell values returned by the UNIQUE.
Now, the most used way to find unique values from a column in Excel is using the Advanced command. You can find this from the Data tab. Take a look at the following dataset: Here, we are going to use the Advanced filter for unique values in a column. First, select any cell from the column. Then, go to the Data Tab. After that, click on Advanced.
Unique values are the items that appear in a dataset only once. Distinct values are all different items in a list, i.e. unique values and 1 st occurrences of duplicate values. And now, let's investigate the most efficient techniques to deal with unique and distinct values in your Excel sheets.
To quickly select the unique or distinct list including column headers, filter unique values, click on any cell in the unique list, and then press Ctrl + A. To select distinct or unique values without column headers, filter unique values, select the first cell with data,...
In pandas, you can achieve this by using groupby:
In [97]: data
Out[97]:
array([[ 1., 8.],
[ 2., 4.],
[ 3., 1.],
[ 4., 5.],
[ 5., 6.],
[ 1., 9.],
[ 2., 0.],
[ 3., 7.],
[ 4., 3.],
[ 5., 2.]])
In [98]: import pandas as pd
In [99]: df = pd.DataFrame(data, columns=['first', 'second'])
In [100]: df.groupby('first').mean().reset_index()
Out[100]:
first second
0 1.0 8.5
1 2.0 2.0
2 3.0 4.0
3 4.0 4.0
4 5.0 4.0
Write a comparison statement checking the first column for your unique value, use that statement as a boolean index,
>>> mask = a[:,0] == 1
>>> a[mask]
array([[ 1., 8.],
[ 1., 9.]])
for n in np.unique(a[:,0]):
mask = a[:,0] == n
print(np.mean(a[mask], axis = 0))
>>>
[ 1. 8.5]
[ 2. 2.]
[ 3. 4.]
[ 4. 4.]
[ 5. 4.]
If your data file looks something like this
'''
1., 8.
2., 4.
3., 1.
4., 5.
'''
and you don't really need a numpy array, just use a dictionary:
import collections
d = collections.defaultdict(list)
with open('file.txt') as f:
for line in f:
line = line.strip()
first, second = map(float, line.split(','))
d[first].append(second)
for first, second in d.iteritems():
print(first, sum(second) / len(second))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With