Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NumPy Difference Between np.average() and np.mean() [duplicate]

NumPy has two different functions for calculating an average:

np.average()

and

np.mean()

Since it is unlikely that NumPy would include a redundant feature their must be a nuanced difference.

This was a concept I was very unclear on when starting data analysis in Python so I decided to make a detailed self-answer here as I am sure others are struggling with it.

like image 828
AdamSC Avatar asked Mar 12 '23 18:03

AdamSC


1 Answers

Short Answer:

'Mean' and 'Average' are two different things. People use them interchangeably but shouldn't. np.mean() gives you the arithmetic mean where as np.average() allows you to get the arithmetic mean if you don't add other parameters, but can also be used to take a weighted average.

Long Answer and Background:

Statistics:

Since NumPy is mostly used for working with data sets it is important to understand the mathematical concept that causes this confusion. In simple mathematics and every day life we use the word Average and Mean as interchangeable words when this is not the case.

  • Mean: Commonly refers to the 'Arithmetic Mean' or the sum of a collection of numbers divided by the number of numbers in the collection1
  • Average: Average can refer to many different calculations, of which the 'Arithmetic Mean' is one. Others include 'Median', 'Mode', 'Weighted Mean, 'Interquartile Mean' and many others.2

What This Means For NumPy:

Back to the topic at hand. Since NumPy is normally used in applications related to mathematics it needs to be a bit more precise about the difference between Average() and Mean() than tools like Excel which use Average() as a function for finding the 'Arithmetic Mean'.

np.mean()

In NumPy, np.mean() will allow you to calculate the 'Arithmetic Mean' across a specified axis.

Here's how you would use it:

myArray = np.array([[3, 4], [5, 6]])
np.mean(myArray)

There are also parameters for changing which dType is used and which axis the function should compute along (the default is the flattened array).

np.average()

np.average() on the other hand allows you to take a 'Weighted Mean' in which different numbers in your array may have a different weight. For example, in the documentation we can see:

>>> data = range(1,5)
>>> data
[1, 2, 3, 4]
>>> np.average(data)
2.5
>>> np.average(range(1,11), weights=range(10,0,-1))
4.0

For the last function if you were to take a non-weighted average you would expect the answer to be 6. However, it ends up being 4 because we applied the weights too it.

If you don't have a good handle on what a 'weighted mean' we can try and simplify it:

Consider this a very elementary summary of our 'weighted mean' it isn't going to be quite mathematically accurate (which I hope someone will correct) but it should allow you to visualize what we're discussing.

A mean is the average of all numbers summed and divided by the total number of numbers. This means they all have an equal weight, or are counted once. For our mean sample this meant:

(1+2+3+4+5+6+7+8+9+10+11)/11 = 6

A weighted mean involves including numbers at different weights. Since in our above example it wouldn't include whole numbers it can be a bit confusing to visualize so we'll imagine the weighting fit more nicely across the numbers and it would look something like this:

(1+1+1+1+1+1+1+1+1+1+1+2+2+2+2+2+2+2+2+2+3+3+3+3+3+3+3+3+4+4+4+4+4+4+4+5+5+5+5+5+5+6+6+6+6+6+6+7+7+7+7+7+8+8+8+8+9+9+9+-11)/59 = 3.9~

Even though in the actual number set there is only one instance of the number 1 we're counting it at 10 times its normal weight. This can also be done the other way, we could count a number at 1/3 of its normal weight.

If you don't provide a weight parameter to np.average() it will simply give you the equal weighted average across the flattened axis which is equivalent to the np.mean().

Why Would I Ever Use np.mean()?

If np.average() can be used to find the flat arithmetic mean then you may be asking yourself "why would I ever use np.mean()?" np.mean() allows for a few useful parameters that np.average() does not. One of the key ones is the dType parameter which allows you to set the type used in the computation.

For example the NumPy docs give us this case:

Single point precision: 
>>> a = np.zeros((2, 512*512), dtype=np.float32)
>>> a[0, :] = 1.0
>>> a[1, :] = 0.1
>>> np.mean(a)
0.546875 

Based on the calculation above it looks like our average is 0.546875 but if we use the dType parameter to float64 we get a different result:

>>> np.mean(a, dtype=np.float64)
0.55000000074505806

The actual average 0.55000000074505806.

Now, if you round both of these to two significant digits you get 0.55 in both cases. Where this accuracy becomes important is if you are doing multiple sets of operations on the number still, especially when dealing with very large (or very small numbers) that need a high accuracy.

For example:

((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) = 231,044,656.404611

((((0.55000000074505806*184.6651)^5)+0.666321)/46.778) = 231,044,654.839687

Even in simpler equations you can end up being off by a few decimal places and that can be relevant in:

  • Scientific simulations: Due to lengthy equations, multiple steps and a high degree of accuracy needed.
  • Statistics: The difference between a few percentage points of accuracy can be crucial (for example in medical studies).
  • Finance: Continually being off by even a few cents in large financial models or when tracking large amounts of capital (banking/private equity) could result in hundreds of thousands of dollars in errors by the end of the year.

Important Word Distinction Lastly, simply on interpretation you may find yourself in a situation where analyzing data where it is asked of you to find the 'Average' of a dataset. You may want to use a different method of average to find the most accurate representation of the dataset. For example, np.median() may be more accurate than np.average() in cases with outliers and so its important to know the statistical difference.

like image 139
AdamSC Avatar answered Apr 30 '23 20:04

AdamSC