I have two sets of statistics generated from processing. The data from the processing can be a large amount of results so I would rather not have to store all of the data to recalculate the additional data later on.
Say I have two sets of statistics that describe two different sessions of runs over a process.
Each set contains
Statistics : { mean, median, standard deviation, runs on process}
How would I merge the two's median, and standard deviation to get a combined summary of the two describing sets of statistics.
Remember, I can't preserve both sets of data that the statistics are describing.
Artelius is mathematically right, but the way he suggests to compute the variance is numerically unstable. You want to compute the variance as follows:
new_var=(n(0)*(var(0)+(mean(0)-new_mean)**2) + n(1)*(var(1)+(mean(1)-new_mean)**2) + ...)/new_n
edit from comment
The problem with the original code is, if your deviation is small compared to your mean, you will end up subtracting a large number from a large number to get a relatively small number, which will cause you to lose floating point precision. The new code avoids this problem; rather than convert to E(X^2) and back, it just adds all the contributions to the total variance together, properly weighted according to their sample size.
Median is not possible. Say you have two tuples, (1, 1, 1, 2), and (0, 0, 2, 3, 3). Medians are 1 and 2, overall median is 1. No way to tell.
You can get the mean and standard deviation, but not the median.
new_n = (n(0) + n(1) + ...)
new_mean = (mean(0)*n(0) + mean(1)*n(1) + ...) / new_n
new_var = ((var(0)+mean(0)**2)*n(0) + (var(1)+mean(1)**2)*n(1) + ...) / new_n - new_mean**2
where n(0)
is the number of runs in the first data set, n(1)
is the number of runs in the second, and so on, mean
is the mean, and var
is the variance (which is just standard deviation squared). n**2
means "n squared".
Getting the combined variance relies on the fact that the variance of a data set is equal to the mean of the square of the data set minus the square of the mean of the data set. In statistical language,
Var(X) = E(X^2) - E(X)^2
The var(n)+mean(n)**2
terms above give us the E(X^2)
portion which we can then combine with other data sets, and then get the desired result.
In terms of medians:
If you are combining exactly two data sets, then you can be certain that the combined median lies somewhere between the two medians (or equal to one of them), but there is little more that you can say. Taking their average should be OK unless you want to avoid the median not being equal to some data point.
If you are combining many data sets in one go, you can either take the median of the medians, or take their average. If there may be significant systematic differences between different the data sets, then taking their average is probably better, as taking the median reduces the effect of outliers. But if you have systematic differences between runs, disregarding them is probably not a good thing to do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With