NumPy has two different functions for calculating an average:
np.average()
and
np.mean()
Since it is unlikely that NumPy would include a redundant feature their must be a nuanced difference.
This was a concept I was very unclear on when starting data analysis in Python so I decided to make a detailed self-answer here as I am sure others are struggling with it.
Short Answer:
'Mean' and 'Average' are two different things. People use them interchangeably but shouldn't. np.mean() gives you the arithmetic mean where as np.average() allows you to get the arithmetic mean if you don't add other parameters, but can also be used to take a weighted average.
Long Answer and Background:
Statistics:
Since NumPy is mostly used for working with data sets it is important to understand the mathematical concept that causes this confusion. In simple mathematics and every day life we use the word Average and Mean as interchangeable words when this is not the case.
What This Means For NumPy:
Back to the topic at hand. Since NumPy is normally used in applications related to mathematics it needs to be a bit more precise about the difference between Average() and Mean() than tools like Excel which use Average() as a function for finding the 'Arithmetic Mean'.
np.mean()
In NumPy, np.mean() will allow you to calculate the 'Arithmetic Mean' across a specified axis.
Here's how you would use it:
myArray = np.array([[3, 4], [5, 6]])
np.mean(myArray)
There are also parameters for changing which dType is used and which axis the function should compute along (the default is the flattened array).
np.average()
np.average() on the other hand allows you to take a 'Weighted Mean' in which different numbers in your array may have a different weight. For example, in the documentation we can see:
>>> data = range(1,5)
>>> data
[1, 2, 3, 4]
>>> np.average(data)
2.5
>>> np.average(range(1,11), weights=range(10,0,-1))
4.0
For the last function if you were to take a non-weighted average you would expect the answer to be 6. However, it ends up being 4 because we applied the weights too it.
If you don't have a good handle on what a 'weighted mean' we can try and simplify it:
Consider this a very elementary summary of our 'weighted mean' it isn't going to be quite mathematically accurate (which I hope someone will correct) but it should allow you to visualize what we're discussing.
A mean is the average of all numbers summed and divided by the total number of numbers. This means they all have an equal weight, or are counted once. For our mean sample this meant:
(1+2+3+4+5+6+7+8+9+10+11)/11 = 6
A weighted mean involves including numbers at different weights. Since in our above example it wouldn't include whole numbers it can be a bit confusing to visualize so we'll imagine the weighting fit more nicely across the numbers and it would look something like this:
(1+1+1+1+1+1+1+1+1+1+1+2+2+2+2+2+2+2+2+2+3+3+3+3+3+3+3+3+4+4+4+4+4+4+4+5+5+5+5+5+5+6+6+6+6+6+6+7+7+7+7+7+8+8+8+8+9+9+9+-11)/59 = 3.9~
Even though in the actual number set there is only one instance of the number 1 we're counting it at 10 times its normal weight. This can also be done the other way, we could count a number at 1/3 of its normal weight.
If you don't provide a weight parameter to np.average() it will simply give you the equal weighted average across the flattened axis which is equivalent to the np.mean().
Why Would I Ever Use np.mean()?
If np.average() can be used to find the flat arithmetic mean then you may be asking yourself "why would I ever use np.mean()?" np.mean() allows for a few useful parameters that np.average() does not. One of the key ones is the dType parameter which allows you to set the type used in the computation.
For example the NumPy docs give us this case:
Single point precision:
>>> a = np.zeros((2, 512*512), dtype=np.float32)
>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
>>> np.mean(a)
0.546875
Based on the calculation above it looks like our average is 0.546875 but if we use the dType parameter to float64 we get a different result:
>>> np.mean(a, dtype=np.float64)
0.55000000074505806
The actual average 0.55000000074505806.
Now, if you round both of these to two significant digits you get 0.55 in both cases. Where this accuracy becomes important is if you are doing multiple sets of operations on the number still, especially when dealing with very large (or very small numbers) that need a high accuracy.
For example:
((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) = 231,044,656.404611
((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) = 231,044,654.839687
Even in simpler equations you can end up being off by a few decimal places and that can be relevant in:
Important Word Distinction Lastly, simply on interpretation you may find yourself in a situation where analyzing data where it is asked of you to find the 'Average' of a dataset. You may want to use a different method of average to find the most accurate representation of the dataset. For example, np.median() may be more accurate than np.average() in cases with outliers and so its important to know the statistical difference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With