In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from. <pre class="prettyprint"><code>x <- c(11.4,25.3,29.9,16.5,21.1) y <- c(23.7,26.6,28.5,14.2,17.9,24.3) total <- c(x,y) library(boot) diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]]) b <- boot(total, diff, R = 10000) ci <- boot.ci(b) p.value <- sum(b$t>=b$t0)/b$R </code></pre> What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y). Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?

EDIT2 : Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function : <pre class="prettyprint"><code>total <- c(x,y) id <- as.factor(c(rep("x",length(x)),rep("y",length(y)))) b <- boot(total, diff, strata=id, R = 10000) ... </code></pre> Be aware you're not going to get even close to a correct estimate of your p.value : <pre class="prettyprint"><code>x <- c(1.4,2.3,2.9,1.5,1.1) y <- c(23.7,26.6,28.5,14.2,17.9,24.3) total <- c(x,y) b <- boot(total, diff, strata=id, R = 10000) ci <- boot.ci(b) p.value <- sum(b$t>=b$t0)/b$R > p.value [1] 0.5162 </code></pre> How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first? The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.

Bootstrapping to compare two groups

Tags:

r

statistics-bootstrap

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.

x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)

ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R

What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y). Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?

237

asked Sep 01 '10 07:09

George Dontas

1 Answers

EDIT2 : Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :

total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...

Be aware you're not going to get even close to a correct estimate of your p.value :

x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)

total <- c(x,y)

b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162

How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?

The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.

173

answered Sep 29 '22 23:09

Joris Meys

Related questions
                            
                                Standardize variables using dplyr [r]
                            
                                Fable: Extracting the p,d,q specification from an ARIMA model
                            
                                How to shade shapes
                            
                                Column to nested list separated by /
                            
                                How to select a specific tab in R Markdown?
                            
                                How to plot 'outside' of plotting area using ggplot in R?
                            
                                Passing files from a rocker container to a latex container within a gitlab-ci job
                            
                                How to use ifelse inside map function in R
                            
                                Parallel processing in R - setting seed with mclapply() vs. pbmclapply()
                            
                                Adding labels to plotly map created using plot_geo
                            
                                Using accumulate function with second to last value as .init argument
                            
                                Dynamic creation of tabs in Rmarkdown does not work for ggplot while it does for plotly
                            
                                Convert the columns in each element of a list to strings
                            
                                Collapsing one hot encoded columns based on conditional in R dplyr
                            
                                Dynamically create R graphics for webpage
                            
                                How can I plot multiple functions in R?
                            
                                Update a package and keep it from reverting to the original
                            
                                How do you refresh the contents of an R gWidget?
                            
                                Grouped bar chart with ggplot2 and already tabulated data
                            
                                Determinant of a complex matrix in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With