Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to get a sum of two distributions?

I have a simple question. I would like to sum of two non-parametric distributions.

Here is an example. There are two cities which have 10 houses. we know energy consumption for each house. (edited) I want to get the probability distribution of the sum of a random house chosen from each city.

A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B

I have a probability distribution of A1 and B1, how can I get the probability distribution of A1+B1? If I just use A1+B1 in R, it gives 12 15 18 20 20 22 22 24 26 29. However, I don't think this is right. Becuase there is not order in houses.

When I change the order of houses, it gives another results.

# Original
A1 <- c(1,2,3,3,3,4,4,5,6,7)
B1 <- c(11,13,15,17,17,18,18,19,20,22)
#change order 1
A2 <- c(7,6,5,4,4,3,3,3,2,1) 
B2 <- c(22,20,19,18,18,17,17,15,13,11)
#change order 2
A3 <- c(3,3,3,4,4,5,6,7,1,2) 
B3 <- c(17,17,18,18,19,13,20,11,22,15)
sum1 <- A1+B1; sum1
sum2 <- A1+B2; sum2
sum3 <- A3+B3; sum3

enter image description here

Red lines are sum1, sum2, and sum3. I am not sure how can I get the distribution of the sum of two distributions.Please give me any ideas.Thanks!

(If those distributions are normal or uniform distributions, I could get the sum of distribution easily, but these are not a normal and there is no order)

like image 856
Open your eyes Avatar asked Dec 08 '15 22:12

Open your eyes


People also ask

How do you combine two distributions?

One common method of consolidating two probability distributions is to simply average them - for every set of values A, set If the distributions both have densities, for example, averaging the probabilities results in a probability distribution with density the average of the two input densities (Figure 1).

How do you add two probability distributions?

The formula is simple: for any value for x, add the values of the PMFs at that value for x, weighted appropriately. If the sum of the weights is 1, then the sum of the values of the weighted sum of your PMFs will be 1, so the weighted sum of your PMFs will be a probability distribution.

How do you find the sum of two random variables?

For any two random variables X and Y, the variance of the sum of those variables is equal to the sum of the variances plus twice the covariance.

Can you add two normal distributions?

Independent random variables This means that the sum of two independent normally distributed random variables is normal, with its mean being the sum of the two means, and its variance being the sum of the two variances (i.e., the square of the standard deviation is the sum of the squares of the standard deviations).


3 Answers

In theory, the sum distribution of two random variables is the convolution of their PDFs, details, as:

PDF(Z) = PDF(Y) * PDF(X)

So, I think this case can be computed by convolution.

# your data
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B

# compute PDF/CDF
PDF_A1 <- table(A1)/length(A1)
CDF_A1 <- cumsum(PDF_A1)

PDF_B1 <- table(B1)/length(B1)
CDF_B1 <- cumsum(PDF_B1)

# compute the sum distribution 
PDF_C1 <- convolve(PDF_B1, PDF_A1, type = "open")

# plotting
plot(PDF_C1, type="l", axe=F, main="PDF of A1+B1")
box()
axis(2)
# FIXME: is my understand for X correct?
axis(1, at=seq(1:14), labels=(c(names(PDF_A1)[-1],names(PDF_B1))))

enter image description here

Note:

CDF: cumulative distribution function

PDF: probability density function

## To make the x-values correspond to actually sums, consider
## compute PDF
## pad zeros in probability vectors to convolve
r <- range(c(A1, B1))
pdfA <- pdfB <- vector('numeric', diff(r)+1L)
PDF_A1 <- table(A1)/length(A1)                        # same as what you have done
PDF_B1 <- table(B1)/length(B1)
pdfA[as.numeric(names(PDF_A1))] <- as.vector(PDF_A1)  # fill the values
pdfB[as.numeric(names(PDF_B1))] <- as.vector(PDF_B1)

## compute the convolution and plot
res <- convolve(pdfA, rev(pdfB), type = "open")
plot(res, type="h", xlab='Sum', ylab='')

enter image description here

## In this simple case (with discrete distribution) you can compare
## to previous solution
tst <- rowSums(expand.grid(A1, B1))
plot(table(tst) / sum(as.vector(table(tst))), type='h')

enter image description here

like image 196
Patric Avatar answered Nov 22 '22 10:11

Patric


Edit:

Now that I better understand the question, and see @jeremycg 's answer, I think I have a different approach that I think will scale better with sample size.

Rather than relying on the values in A1 and B1 being the only values in the distribution, we could infer that those are simply samples from a distribution. To avoid imposing a particular form on the distribution, I'll use an empirical 'equivalent': the sample density. If we use the density function, we can infer the relative probabilities of sampling a continuous range of household energy uses from either town. We can randomly draw an arbitrary number of energies (with replacement), from the density()$x values, where the sample's we take are weighted with prob=density()$y ... i.e., peaks in the density plot are at x-values that should be resample more often.

As a heuristic, an oversimplified statement could say that mean(A1) is 3.8, and mean(B1) is 17, so the sum of energy uses from the two cities should be, on average, ~20.8. Using this as the "does it make sense test"/ heuristic, I think the following approach aligns with the type of result you want.

sample_sum <- function(A, B, n, ...){
    qss <- function(X, n, ...){
        r_X <- range(X)
        dens_X <- density(X, ...)
        sample(dens_X$x, size=n, prob=dens_X$y, replace=TRUE)
    }

    sample_A <- qss(A, n=n, ...)
    sample_B <- qss(B, n=n, ...)

    sample_A + sample_B
}

ss <- sample_sum(A1, B1, n=100, from=0)

png("~/Desktop/answer.png", width=5, height=5, units="in", res=150)
plot(density(ss))
dev.off()

Note that I bounded the density plot at 0, because I'm assuming you don't want to infer negative energies. I see that the peak in the resultant density is just above 20, so 'it makes sense'.

The potential advantage here is that you don't need to look at every possible combination of energies from the houses in the two cities to understand the distribution of summed energy uses. If you can define the distribution of both, you can define the distribution of paired sums.

Finally, the computation time is trivial, especially compared the approach finding all combinations. E.g., with 10 million houses in each city, if I try to do the expand.grid approach I get a Error: cannot allocate vector of size 372529.0 Gb error, whereas the sample_sum approach takes 0.12 seconds.

Of course, if the answer doesn't help you, the speed is worthless ;)

enter image description here

like image 27
rbatt Avatar answered Nov 22 '22 10:11

rbatt


You probably want something like:

rowSums(expand.grid(A1, B1))

Using expand.grid will get you a dataframe of all combinations of A1 and B1, and rowSums will add them.

like image 26
jeremycg Avatar answered Nov 22 '22 11:11

jeremycg