I have a simple question. I would like to sum of two non-parametric distributions.
Here is an example. There are two cities which have 10 houses. we know energy consumption for each house. (edited) I want to get the probability distribution of the sum of a random house chosen from each city.
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
I have a probability distribution of A1 and B1, how can I get the probability distribution of A1+B1?
If I just use A1+B1
in R, it gives 12 15 18 20 20 22 22 24 26 29
. However, I don't think this is right. Becuase there is not order in houses.
When I change the order of houses, it gives another results.
# Original
A1 <- c(1,2,3,3,3,4,4,5,6,7)
B1 <- c(11,13,15,17,17,18,18,19,20,22)
#change order 1
A2 <- c(7,6,5,4,4,3,3,3,2,1)
B2 <- c(22,20,19,18,18,17,17,15,13,11)
#change order 2
A3 <- c(3,3,3,4,4,5,6,7,1,2)
B3 <- c(17,17,18,18,19,13,20,11,22,15)
sum1 <- A1+B1; sum1
sum2 <- A1+B2; sum2
sum3 <- A3+B3; sum3
Red lines are sum1, sum2, and sum3. I am not sure how can I get the distribution of the sum of two distributions.Please give me any ideas.Thanks!
(If those distributions are normal or uniform distributions, I could get the sum of distribution easily, but these are not a normal and there is no order)
One common method of consolidating two probability distributions is to simply average them - for every set of values A, set If the distributions both have densities, for example, averaging the probabilities results in a probability distribution with density the average of the two input densities (Figure 1).
The formula is simple: for any value for x, add the values of the PMFs at that value for x, weighted appropriately. If the sum of the weights is 1, then the sum of the values of the weighted sum of your PMFs will be 1, so the weighted sum of your PMFs will be a probability distribution.
For any two random variables X and Y, the variance of the sum of those variables is equal to the sum of the variances plus twice the covariance.
Independent random variables This means that the sum of two independent normally distributed random variables is normal, with its mean being the sum of the two means, and its variance being the sum of the two variances (i.e., the square of the standard deviation is the sum of the squares of the standard deviations).
In theory, the sum distribution of two random variables is the convolution of their PDFs, details, as:
PDF(Z) = PDF(Y) * PDF(X)
So, I think this case can be computed by convolution
.
# your data
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B
# compute PDF/CDF
PDF_A1 <- table(A1)/length(A1)
CDF_A1 <- cumsum(PDF_A1)
PDF_B1 <- table(B1)/length(B1)
CDF_B1 <- cumsum(PDF_B1)
# compute the sum distribution
PDF_C1 <- convolve(PDF_B1, PDF_A1, type = "open")
# plotting
plot(PDF_C1, type="l", axe=F, main="PDF of A1+B1")
box()
axis(2)
# FIXME: is my understand for X correct?
axis(1, at=seq(1:14), labels=(c(names(PDF_A1)[-1],names(PDF_B1))))
Note:
CDF: cumulative distribution function
PDF: probability density function
## To make the x-values correspond to actually sums, consider
## compute PDF
## pad zeros in probability vectors to convolve
r <- range(c(A1, B1))
pdfA <- pdfB <- vector('numeric', diff(r)+1L)
PDF_A1 <- table(A1)/length(A1) # same as what you have done
PDF_B1 <- table(B1)/length(B1)
pdfA[as.numeric(names(PDF_A1))] <- as.vector(PDF_A1) # fill the values
pdfB[as.numeric(names(PDF_B1))] <- as.vector(PDF_B1)
## compute the convolution and plot
res <- convolve(pdfA, rev(pdfB), type = "open")
plot(res, type="h", xlab='Sum', ylab='')
## In this simple case (with discrete distribution) you can compare
## to previous solution
tst <- rowSums(expand.grid(A1, B1))
plot(table(tst) / sum(as.vector(table(tst))), type='h')
Edit:
Now that I better understand the question, and see @jeremycg 's answer, I think I have a different approach that I think will scale better with sample size.
Rather than relying on the values in A1
and B1
being the only values in the distribution, we could infer that those are simply samples from a distribution. To avoid imposing a particular form on the distribution, I'll use an empirical 'equivalent': the sample density. If we use the density
function, we can infer the relative probabilities of sampling a continuous range of household energy uses from either town. We can randomly draw an arbitrary number of energies (with replacement), from the density()$x
values, where the sample
's we take are weighted with prob=density()$y
... i.e., peaks in the density plot are at x-values that should be resample more often.
As a heuristic, an oversimplified statement could say that mean(A1)
is 3.8, and mean(B1)
is 17, so the sum of energy uses from the two cities should be, on average, ~20.8. Using this as the "does it make sense test"/ heuristic, I think the following approach aligns with the type of result you want.
sample_sum <- function(A, B, n, ...){
qss <- function(X, n, ...){
r_X <- range(X)
dens_X <- density(X, ...)
sample(dens_X$x, size=n, prob=dens_X$y, replace=TRUE)
}
sample_A <- qss(A, n=n, ...)
sample_B <- qss(B, n=n, ...)
sample_A + sample_B
}
ss <- sample_sum(A1, B1, n=100, from=0)
png("~/Desktop/answer.png", width=5, height=5, units="in", res=150)
plot(density(ss))
dev.off()
Note that I bounded the density plot at 0, because I'm assuming you don't want to infer negative energies. I see that the peak in the resultant density is just above 20, so 'it makes sense'.
The potential advantage here is that you don't need to look at every possible combination of energies from the houses in the two cities to understand the distribution of summed energy uses. If you can define the distribution of both, you can define the distribution of paired sums.
Finally, the computation time is trivial, especially compared the approach finding all combinations. E.g., with 10 million houses in each city, if I try to do the expand.grid
approach I get a Error: cannot allocate vector of size 372529.0 Gb
error, whereas the sample_sum
approach takes 0.12 seconds.
Of course, if the answer doesn't help you, the speed is worthless ;)
You probably want something like:
rowSums(expand.grid(A1, B1))
Using expand.grid
will get you a dataframe of all combinations of A1 and B1, and rowSums
will add them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With