I have a simple question. I would like to sum of two non-parametric distributions. Here is an example. There are two cities which have 10 houses. we know energy consumption for each house. (edited) I want to get the probability distribution of the sum of a random house chosen from each city. <pre class="prettyprint"><code>A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B </code></pre> I have a probability distribution of A1 and B1, how can I get the probability distribution of A1+B1? If I just use <code>A1+B1</code> in R, it gives <code>12 15 18 20 20 22 22 24 26 29</code>. However, I don't think this is right. Becuase there is not order in houses. When I change the order of houses, it gives another results. <pre class="prettyprint"><code># Original A1 <- c(1,2,3,3,3,4,4,5,6,7) B1 <- c(11,13,15,17,17,18,18,19,20,22) #change order 1 A2 <- c(7,6,5,4,4,3,3,3,2,1) B2 <- c(22,20,19,18,18,17,17,15,13,11) #change order 2 A3 <- c(3,3,3,4,4,5,6,7,1,2) B3 <- c(17,17,18,18,19,13,20,11,22,15) sum1 <- A1+B1; sum1 sum2 <- A1+B2; sum2 sum3 <- A3+B3; sum3 </code></pre> <img src="https://i.stack.imgur.com/zdfza.png" alt="enter image description here"> Red lines are sum1, sum2, and sum3. I am not sure how can I get the distribution of the sum of two distributions.Please give me any ideas.Thanks! (If those distributions are normal or uniform distributions, I could get the sum of distribution easily, but these are not a normal and there is no order)

In theory, the sum distribution of two random variables is the convolution of their PDFs, details, as: <blockquote> PDF(Z) = PDF(Y) * PDF(X) </blockquote> So, I think this case can be computed by <code>convolution</code>. <pre class="prettyprint"><code># your data A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B # compute PDF/CDF PDF_A1 <- table(A1)/length(A1) CDF_A1 <- cumsum(PDF_A1) PDF_B1 <- table(B1)/length(B1) CDF_B1 <- cumsum(PDF_B1) # compute the sum distribution PDF_C1 <- convolve(PDF_B1, PDF_A1, type = "open") # plotting plot(PDF_C1, type="l", axe=F, main="PDF of A1+B1") box() axis(2) # FIXME: is my understand for X correct? axis(1, at=seq(1:14), labels=(c(names(PDF_A1)[-1],names(PDF_B1)))) </code></pre> <img src="https://i.stack.imgur.com/fzsCC.png" alt="enter image description here"> Note: <blockquote> CDF: cumulative distribution function PDF: probability density function </blockquote> <pre class="prettyprint"><code>## To make the x-values correspond to actually sums, consider ## compute PDF ## pad zeros in probability vectors to convolve r <- range(c(A1, B1)) pdfA <- pdfB <- vector('numeric', diff(r)+1L) PDF_A1 <- table(A1)/length(A1) # same as what you have done PDF_B1 <- table(B1)/length(B1) pdfA[as.numeric(names(PDF_A1))] <- as.vector(PDF_A1) # fill the values pdfB[as.numeric(names(PDF_B1))] <- as.vector(PDF_B1) ## compute the convolution and plot res <- convolve(pdfA, rev(pdfB), type = "open") plot(res, type="h", xlab='Sum', ylab='') </code></pre> <img src="https://i.stack.imgur.com/jwh1V.png" alt="enter image description here"> <pre class="prettyprint"><code>## In this simple case (with discrete distribution) you can compare ## to previous solution tst <- rowSums(expand.grid(A1, B1)) plot(table(tst) / sum(as.vector(table(tst))), type='h') </code></pre> <img src="https://i.stack.imgur.com/Z9nLh.png" alt="enter image description here">

Edit: Now that I better understand the question, and see @jeremycg 's answer, I think I have a different approach that I think will scale better with sample size. Rather than relying on the values in <code>A1</code> and <code>B1</code> being the only values in the distribution, we could infer that those are simply samples from a distribution. To avoid imposing a particular form on the distribution, I'll use an empirical 'equivalent': the sample density. If we use the <code>density</code> function, we can infer the relative probabilities of sampling a continuous range of household energy uses from either town. We can randomly draw an arbitrary number of energies (with replacement), from the <code>density()$x</code> values, where the <code>sample</code>'s we take are weighted with <code>prob=density()$y</code> ... i.e., peaks in the density plot are at x-values that should be resample more often. As a heuristic, an oversimplified statement could say that <code>mean(A1)</code> is 3.8, and <code>mean(B1)</code> is 17, so the sum of energy uses from the two cities should be, on average, ~20.8. Using this as the "does it make sense test"/ heuristic, I think the following approach aligns with the type of result you want. <pre class="prettyprint"><code>sample_sum <- function(A, B, n, ...){ qss <- function(X, n, ...){ r_X <- range(X) dens_X <- density(X, ...) sample(dens_X$x, size=n, prob=dens_X$y, replace=TRUE) } sample_A <- qss(A, n=n, ...) sample_B <- qss(B, n=n, ...) sample_A + sample_B } ss <- sample_sum(A1, B1, n=100, from=0) png("~/Desktop/answer.png", width=5, height=5, units="in", res=150) plot(density(ss)) dev.off() </code></pre> Note that I bounded the density plot at 0, because I'm assuming you don't want to infer negative energies. I see that the peak in the resultant density is just above 20, so 'it makes sense'. The potential advantage here is that you don't need to look at every possible combination of energies from the houses in the two cities to understand the distribution of summed energy uses. If you can define the distribution of both, you can define the distribution of paired sums. Finally, the computation time is trivial, especially compared the approach finding all combinations. E.g., with 10 million houses in each city, if I try to do the <code>expand.grid</code> approach I get a <code>Error: cannot allocate vector of size 372529.0 Gb</code> error, whereas the <code>sample_sum</code> approach takes 0.12 seconds. Of course, if the answer doesn't help you, the speed is worthless ;) <img src="https://i.stack.imgur.com/7jrZu.png" alt="enter image description here">

You probably want something like: <pre class="prettyprint"><code>rowSums(expand.grid(A1, B1)) </code></pre> Using <code>expand.grid</code> will get you a dataframe of all combinations of A1 and B1, and <code>rowSums</code> will add them.

R: How to get a sum of two distributions?

Tags:

r

sum

distribution

I have a simple question. I would like to sum of two non-parametric distributions.

Here is an example. There are two cities which have 10 houses. we know energy consumption for each house. (edited) I want to get the probability distribution of the sum of a random house chosen from each city.

A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B

I have a probability distribution of A1 and B1, how can I get the probability distribution of A1+B1? If I just use A1+B1 in R, it gives 12 15 18 20 20 22 22 24 26 29. However, I don't think this is right. Becuase there is not order in houses.

When I change the order of houses, it gives another results.

# Original
A1 <- c(1,2,3,3,3,4,4,5,6,7)
B1 <- c(11,13,15,17,17,18,18,19,20,22)
#change order 1
A2 <- c(7,6,5,4,4,3,3,3,2,1) 
B2 <- c(22,20,19,18,18,17,17,15,13,11)
#change order 2
A3 <- c(3,3,3,4,4,5,6,7,1,2) 
B3 <- c(17,17,18,18,19,13,20,11,22,15)
sum1 <- A1+B1; sum1
sum2 <- A1+B2; sum2
sum3 <- A3+B3; sum3

enter image description here

Red lines are sum1, sum2, and sum3. I am not sure how can I get the distribution of the sum of two distributions.Please give me any ideas.Thanks!

(If those distributions are normal or uniform distributions, I could get the sum of distribution easily, but these are not a normal and there is no order)

856

asked Dec 08 '15 22:12

Open your eyes

3 Answers

In theory, the sum distribution of two random variables is the convolution of their PDFs, details, as:

PDF(Z) = PDF(Y) * PDF(X)

So, I think this case can be computed by convolution.

# your data
A1 <- c(1,2,3,3,3,4,4,5,6,7) #10 houses' energy consumption for city A
B1 <- c(11,13,15,17,17,18,18,19,20,22) #10 houses' energy consumption for city B

# compute PDF/CDF
PDF_A1 <- table(A1)/length(A1)
CDF_A1 <- cumsum(PDF_A1)

PDF_B1 <- table(B1)/length(B1)
CDF_B1 <- cumsum(PDF_B1)

# compute the sum distribution 
PDF_C1 <- convolve(PDF_B1, PDF_A1, type = "open")

# plotting
plot(PDF_C1, type="l", axe=F, main="PDF of A1+B1")
box()
axis(2)
# FIXME: is my understand for X correct?
axis(1, at=seq(1:14), labels=(c(names(PDF_A1)[-1],names(PDF_B1))))

enter image description here

Note:

CDF: cumulative distribution function

PDF: probability density function

## To make the x-values correspond to actually sums, consider
## compute PDF
## pad zeros in probability vectors to convolve
r <- range(c(A1, B1))
pdfA <- pdfB <- vector('numeric', diff(r)+1L)
PDF_A1 <- table(A1)/length(A1)                        # same as what you have done
PDF_B1 <- table(B1)/length(B1)
pdfA[as.numeric(names(PDF_A1))] <- as.vector(PDF_A1)  # fill the values
pdfB[as.numeric(names(PDF_B1))] <- as.vector(PDF_B1)

## compute the convolution and plot
res <- convolve(pdfA, rev(pdfB), type = "open")
plot(res, type="h", xlab='Sum', ylab='')

enter image description here

## In this simple case (with discrete distribution) you can compare
## to previous solution
tst <- rowSums(expand.grid(A1, B1))
plot(table(tst) / sum(as.vector(table(tst))), type='h')

enter image description here

196

answered Nov 22 '22 10:11

Patric

Edit:

Now that I better understand the question, and see @jeremycg 's answer, I think I have a different approach that I think will scale better with sample size.

Rather than relying on the values in A1 and B1 being the only values in the distribution, we could infer that those are simply samples from a distribution. To avoid imposing a particular form on the distribution, I'll use an empirical 'equivalent': the sample density. If we use the density function, we can infer the relative probabilities of sampling a continuous range of household energy uses from either town. We can randomly draw an arbitrary number of energies (with replacement), from the density()$x values, where the sample's we take are weighted with prob=density()$y ... i.e., peaks in the density plot are at x-values that should be resample more often.

As a heuristic, an oversimplified statement could say that mean(A1) is 3.8, and mean(B1) is 17, so the sum of energy uses from the two cities should be, on average, ~20.8. Using this as the "does it make sense test"/ heuristic, I think the following approach aligns with the type of result you want.

sample_sum <- function(A, B, n, ...){
    qss <- function(X, n, ...){
        r_X <- range(X)
        dens_X <- density(X, ...)
        sample(dens_X$x, size=n, prob=dens_X$y, replace=TRUE)
    }

    sample_A <- qss(A, n=n, ...)
    sample_B <- qss(B, n=n, ...)

    sample_A + sample_B
}

ss <- sample_sum(A1, B1, n=100, from=0)

png("~/Desktop/answer.png", width=5, height=5, units="in", res=150)
plot(density(ss))
dev.off()

Note that I bounded the density plot at 0, because I'm assuming you don't want to infer negative energies. I see that the peak in the resultant density is just above 20, so 'it makes sense'.

The potential advantage here is that you don't need to look at every possible combination of energies from the houses in the two cities to understand the distribution of summed energy uses. If you can define the distribution of both, you can define the distribution of paired sums.

Finally, the computation time is trivial, especially compared the approach finding all combinations. E.g., with 10 million houses in each city, if I try to do the expand.grid approach I get a Error: cannot allocate vector of size 372529.0 Gb error, whereas the sample_sum approach takes 0.12 seconds.

Of course, if the answer doesn't help you, the speed is worthless ;)

enter image description here

answered Nov 22 '22 10:11

rbatt

You probably want something like:

rowSums(expand.grid(A1, B1))

Using expand.grid will get you a dataframe of all combinations of A1 and B1, and rowSums will add them.

answered Nov 22 '22 11:11

jeremycg

Related questions
                            
                                ggplot2: More complex faceting
                            
                                Multiple duplicates (2 times, 3 times,...) in R
                            
                                apply multiple functions in sapply
                            
                                Changing values on one dataframe based on data in another dataframe
                            
                                Leaflet map legend in R Shiny app has doesn't show colors
                            
                                How do I put multiple boxplots in the same graph in R?
                            
                                How to use OpenNLP to get POS tags in R?
                            
                                Why do the results of mad(x) differ from the expected results?
                            
                                R: reading in .csv file removes leading zeros
                            
                                Convert from ANSI to UTF-8
                            
                                Fill 'NA's in data frame with information contained in one of the rows with a patient's ID using R
                            
                                How to write an R function or loop that will print every third number or nth number in [1, 100]?
                            
                                CRAN/ Bioconductor package installs fail: Error: Line starting '<!DOCTYPE HTML PUBLI ...' is malformed
                            
                                More efficient ways to use R than 'for' loops
                            
                                From of list of strings, identify which are human names and which are not
                            
                                How to create a discrete normal distribution in R?
                            
                                Including ASCII art in R
                            
                                Change point colors and color of frame/ellipse around points
                            
                                What is the difference the zoo object and ts object in R?
                            
                                ggplot legend: position of key relative to labels

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R: How to get a sum of two distributions?

Tags:

r

sum

distribution

Open your eyes

People also ask

3 Answers

Patric

rbatt

jeremycg

Recent Activity

Donate For Us