I have a set of data (X,Y). My independent variable values X are not unique, so there are multiple repeated values, I want to output a new array containing : X_unique, which is a list of unique values of X. Y_mean, the mean of all of the Y values corresponding to X_unique. Y_std, the standard deviation of all the Y values corresponding to X_unique.
x = data[:,0]
y = data[:,1]
The standard deviation is the square root of the average of the squared deviations from the mean, i.e., std = sqrt(mean(x)) , where x = abs(a - a.mean())**2 . The average squared deviation is typically calculated as x.sum() / N , where N = len(x) . If, however, ddof is specified, the divisor N - ddof is used instead.
mean() Arithmetic mean is the sum of elements along an axis divided by the number of elements. The numpy. mean() function returns the arithmetic mean of elements in the array.
The numpy module of Python provides a function called numpy. std(), used to compute the standard deviation along the specified axis. This function returns the standard deviation of the array elements. The square root of the average square deviation (computed from the mean), is known as the standard deviation.
Finding average of NumPy arrays is quite similar to finding average of given numbers. We just have to get the sum of corresponding array elements and then divide that sum with the total number of arrays.
You can use binned_statistic
from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique
would be useful. Putting all those, here's an implementation -
from scipy.stats import binned_statistic as bstat
# Sort data corresponding to argsort of first column
sdata = data[data[:,0].argsort()]
# Unique col-1 elements and positions of breaks (elements are not identical)
unq_x,breaks = np.unique(sdata[:,0],return_index=True)
breaks = np.append(breaks,data.shape[0])
# Use binned statistic to get grouped average and std deviation values
idx_range = np.arange(data.shape[0])
avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks)
std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)
From the docs of binned_statistic
, one can also use a custom statistic function :
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
Sample input, output -
In [121]: data
Out[121]:
array([[2, 5],
[2, 2],
[1, 5],
[3, 8],
[0, 8],
[6, 7],
[8, 1],
[2, 5],
[6, 8],
[1, 8]])
In [122]: np.column_stack((unq_x,avg_y,std_y))
Out[122]:
array([[ 0. , 8. , 0. ],
[ 1. , 6.5 , 1.5 ],
[ 2. , 4. , 1.41421356],
[ 3. , 8. , 0. ],
[ 6. , 7.5 , 0.5 ],
[ 8. , 1. , 0. ]])
x_unique = np.unique(x)
y_means = np.array([np.mean(y[x==u]) for u in x_unique])
y_stds = np.array([np.std(y[x==u]) for u in x_unique])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With