Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform a bootstrap and find 95% confidence interval for the median of a dataset

I am working to perform a bootstrap using the statistic median for dataset "file", containing only one column "Total". This is it:

Total <-
c(2089, 1567, 1336, 1616, 1590, 1649, 1341, 1614, 1590, 1621, 
1621, 1631, 1295, 107, 18, 195, 2059, 870, 2371, 787, 98, 2422, 
655, 1277, 1336, 2109, 1811, 1337, 1290, 1308, 1359, 1600, 1296, 
693, 107, 1359, 89, 89, 89, 89, 2411, 1639, 89, 89, 1283, 89, 
89, 89, 2341, 1012, 1295, 1853, 1277, 1571, 1288, 1300, 1619, 
107, 555, 1612, 1300, 1300, 2093, 133, 1674, 988, 132, 647, 606, 
544, 873, 274, 120, 1620, 1601, 1601, 906, 1603, 1613, 1592, 
1603, 1610, 1321, 2380, 1575, 1575, 1277, 2354, 1561, 1579, 2367, 
2341, 876, 1612, 1588, 2087, 1612, 890, 1586, 1580, 611, 1797, 
2079, 1937, 189, 171, 706, 1647, 1642, 1278, 1650, 1623, 1647, 
1661, 1692, 1632, 1684, 2474, 403, 842, 593, 98, 2354, 1265, 
866, 1483, 2379, 1650, 1875, 1655, 1632, 1691, 1329, 867, 1632, 
1693, 1623, 829, 1659, 1685, 666, 1585, 1659, 2169, 1623, 1645, 
1654, 1698, 2172, 789, 1698, 579, 2443, 335, 132, 1952, 1265, 
978, 1624, 979, 1729, 607, 181, 752, 424, 386, 309, 998, 1435, 
2476, 392, 1657, 348, 1652, 1646, 1345, 2445, 1655, 840, 1624, 
1652, 1321, 1321, 2201, 957, 917, 2458, 4096, 2458, 1346, 2459, 
1634, 2459, 2459, 2459, 2508, 714, 2457, 2457, 1703, 669, 976, 
1634, 2459, 2491, 2393, 625, 1763, 879, 886, 1085, 731, 924, 
1649, 1216, 1647, 2470, 668, 2326, 757, 215, 276, 186, 901, 1402, 
429, 554, 2457, 1643, 986, 730, 1028, 971, 1952, 1584, 1023, 
1352, 839, 2434, 430, 2462, 1327, 1004, 385, 1099, 1067, 758, 
679, 1423, 2495, 1664, 2495, 2495, 1345, 2530, 1754, 1804, 2525, 
1652, 2536, 1646, 2529, 1380, 1845, 963, 1339, 2482, 1417, 1729, 
1384, 1648, 344, 1648, 955, 609, 485, 1822, 513, 223, 222, 193, 
1410, 1159, 586, 585, 2671, 2702, 2529, 2212, 1658, 741, 2529, 
861, 1758, 905, 2529, 597, 1049, 2529, 619, 2620, 2596, 1688, 
2590, 2545, 2590, 883, 287, 723, 2565, 1835, 1738, 2243, 1693, 
2565, 250, 2529, 1880, 1777, 701, 444, 927, 1127, 825, 2726, 
1977, 235, 241, 269, 660, 1523, 420, 678, 213, 544, 940, 983, 
605, 2716, 1848, 1848, 182, 1225, 365, 993, 224, 267, 309, 271, 
324, 178, 2657, 1772, 546, 456, 2637, 1771, 677, 1409, 653, 2359, 
690, 828, 2742, 1812, 2777, 552, 1572, 2742, 2792, 2819, 1753, 
265, 1901, 1753, 2716, 2800, 2742, 453, 2742, 586, 1920, 929, 
1897, 2742, 1859, 1899, 1106, 1135, 759, 730, 1838, 863, 1929, 
2751, 2751, 2751, 2751, 713, 430, 2788, 1784, 966, 2483, 1784, 
1786, 2727, 857, 1798, 1815, 730, 390, 593, 1489, 1448, 1784, 
1510, 2788, 812, 856, 808, 941, 2797, 2757, 1852, 2757, 2412, 
486, 1034, 615, 845, 974, 727, 969, 2916, 1841, 1926, 1926, 533, 
446, 733, 696, 1214, 1857, 1907, 2824, 2631, 3556, 2496, 1617, 
1000, 707, 936, 761, 960, 1936, 857, 423, 1130, 1165, 2453, 338, 
988, 1869, 1951, 1932, 2820, 2742, 628, 447, 866, 637, 932, 2742, 
1795, 2881, 695, 762, 2778, 427, 714, 2781, 1865, 1861, 678, 
1465, 1770, 845, 356, 817, 385, 1820, 2692, 1787, 1510, 1814, 
857, 2616, 204, 465, 1773, 2754, 1793, 1773, 1900, 185, 2706, 
1162, 766, 2742, 1816, 2742, 1790, 1803, 1795, 1026, 334, 832, 
478, 1849, 2679, 1773, 797, 2649, 1814, 1808, 99, 2037, 2616, 
2719, 1813, 2637, 2648, 1813, 865, 1717, 2588, 2711, 2818, 1828, 
2553, 2720, 1791, 1780, 2706, 2565, 1717, 1881, 1037, 329, 893, 
723, 1821, 2692, 2586, 2729, 1755, 1793, 2670, 2602, 2638, 2684, 
1813, 1755, 1755, 2626, 832, 739, 724, 1968, 2598, 2627, 851, 
749, 684, 625, 2673, 2778, 1764, 2644, 1800, 1792, 511, 2776, 
1890, 1764, 2776, 1040, 1049, 2699, 2061, 897, 1764, 274, 2755, 
1912, 2581, 1780, 820, 1803, 2692, 2783, 572, 2751, 2699, 1830, 
1875, 633, 1083)

Then I tried to use the bootstrap function:

> boot (Total, median, 1000)        

ORDINARY NONPARAMETRIC BOOTSTRAP

Call: boot(data = Total, statistic = median, R = 1000)

Bootstrap Statistics : original bias std. error t1* 1603 0 0 There were 50 or more warnings (use warnings() to see the first 50)

The warning message was: the condition has length > 1 and only the first element will be used

Can you please advise me how do I perform bootstrap to generate 95% confidence intervals for the median? I am a beginner in this and your help would be much appreciated.

Thank you so much in advance.

like image 446
ana57 Avatar asked Jul 23 '18 14:07

ana57


2 Answers

Admittedly the boot function from the boot package has a slightly non-intuitive aspect to it. But if you read the documentation (or look at the examples in the documentation) you'll see specific instructions about the statistic argument:

In all other cases statistic must take at least two arguments. The first argument passed will always be the original data. The second will be a vector of indices, frequencies or weights which define the bootstrap sample.

So instead of:

x <- rnorm(10)
boot(data = x,statistic = median,R = 1000)

You want this:

boot(data = x,statistic = function(x,i) median(x[i]),R = 1000)

Once you're that far, the function boot.ci() can be used to compute the confidence intervals (only some of them are available in this particular example I believe).

b <- boot(data = x,statistic = function(x,i) median(x[i]),R = 1000)
boot.ci(b)
like image 102
joran Avatar answered Sep 28 '22 18:09

joran


Though the answer by @joran is right, since I already had code tested, with the CI computation, here it goes.

library(boot)

bootMedian <- function(data, indices) median(data[indices])

b <- boot(Total, bootMedian, R = 1000)

boot.ci(b)
like image 40
Rui Barradas Avatar answered Sep 28 '22 17:09

Rui Barradas