Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing grouped average and standard deviation with NumPy arrays

I have a set of data (X,Y). My independent variable values X are not unique, so there are multiple repeated values, I want to output a new array containing : X_unique, which is a list of unique values of X. Y_mean, the mean of all of the Y values corresponding to X_unique. Y_std, the standard deviation of all the Y values corresponding to X_unique.

x = data[:,0]
y = data[:,1]
like image 432
obtmind Avatar asked Jan 05 '16 17:01

obtmind


People also ask

How do you find the mean and standard deviation of a NumPy array?

The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean(x)) , where x = abs(a - a.mean())**2 . The average squared deviation is typically calculated as x.sum() / N , where N = len(x) . If, however, ddof is specified, the divisor N - ddof is used instead.

How do you average an array in NumPy?

mean() Arithmetic mean is the sum of elements along an axis divided by the number of elements. The numpy. mean() function returns the arithmetic mean of elements in the array.

How do you use std with NumPy?

The numpy module of Python provides a function called numpy. std(), used to compute the standard deviation along the specified axis. This function returns the standard deviation of the array elements. The square root of the average square deviation (computed from the mean), is known as the standard deviation.

How do you average multiple NumPy arrays?

Finding average of NumPy arrays is quite similar to finding average of given numbers. We just have to get the sum of corresponding array elements and then divide that sum with the total number of arrays.


2 Answers

You can use binned_statistic from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique would be useful. Putting all those, here's an implementation -

from scipy.stats import binned_statistic as bstat

# Sort data corresponding to argsort of first column
sdata = data[data[:,0].argsort()]

# Unique col-1 elements and positions of breaks (elements are not identical)
unq_x,breaks = np.unique(sdata[:,0],return_index=True)
breaks = np.append(breaks,data.shape[0])

# Use binned statistic to get grouped average and std deviation values
idx_range = np.arange(data.shape[0])
avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks)
std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)

From the docs of binned_statistic, one can also use a custom statistic function :

function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.

Sample input, output -

In [121]: data
Out[121]: 
array([[2, 5],
       [2, 2],
       [1, 5],
       [3, 8],
       [0, 8],
       [6, 7],
       [8, 1],
       [2, 5],
       [6, 8],
       [1, 8]])

In [122]: np.column_stack((unq_x,avg_y,std_y))
Out[122]: 
array([[ 0.        ,  8.        ,  0.        ],
       [ 1.        ,  6.5       ,  1.5       ],
       [ 2.        ,  4.        ,  1.41421356],
       [ 3.        ,  8.        ,  0.        ],
       [ 6.        ,  7.5       ,  0.5       ],
       [ 8.        ,  1.        ,  0.        ]])
like image 161
Divakar Avatar answered Nov 15 '22 00:11

Divakar


x_unique  = np.unique(x)
y_means = np.array([np.mean(y[x==u]) for u in x_unique])
y_stds = np.array([np.std(y[x==u]) for u in x_unique])
like image 21
Peter Avatar answered Nov 14 '22 23:11

Peter