Why does chisq.test
function in R sorts data before summation in descending order?
The code in question is:
STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))
If I was worried about numerical stability due to usage of float arithmetic and wanted to use some easy to deploy hack, I would sort the data in increasing order before summation to avoid adding a tiny value to large value in the accumulator (in order to avoid trimming of the least significant bits in the result as much as possible).
I looked into the source code of sum but it did not explain why to pass data in descending order to sum()
. What do I miss?
An example:
x = matrix(1.1, 10001, 1) x[1] = 10^16 # We have a vector with 10000*1.1 and 1*10^16 c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))
The result:
10000000000010996 10000000000011000
When we sort the data in the ascending order, we get the correct result. If we sort the data in the descending order, we get a result that is off by 4.
If you want the data arranged in reverse alphabetical order, then specify descending order. where DESC indicates the rows are to be arranged in descending order.
Data is typically sorted based on actual values, counts or percentages, in either ascending or descending order, but can also be sorted based on the variable value labels.
Sorting is one of the most common tools for data management. In Excel, you can sort your table by one or more columns, by ascending or descending order, or do a custom sort.
The median is the middle value in a series of numbers which is first sorted in either ascending or descending order. When there are extremes in the series that could affect the mean of the numbers, the median can be utilised instead of the mean.
EDIT: The book "Accuracy and stability of numerical algorithms by Nicolas J. Higham" states that
"when summing nonnegative numbers by recursive summation the increasing ordering is the best ordering, in the sense of having the smallest a priori forward error bound."
Thanks @Lamia for sharing the book in the comment section.
This books explains three methods of summation such as recursive, insertion and pairwise techniques. Each technique has its own merits and demerits based on the magnitude of the error bound associated with them, which can be computed by doing systematic error analysis of summation of floating point numbers.
Notably, the summation results from recursive technique is dependent on the ordering strategies such as increasing, decreasing and Psum (look in the book - page 82 - 4th paragraph. also see how it works in the example given at the bottom of page 82.).
Looking at the R source code for the sum()
function which can be obtained from summary.c informs that R implements recursive method in its sum()
function.
Also the number of base digits in the floating-point significand is 53, which can be obtained from
.Machine$double.digits # [1] 53
By setting this number as the precision bits, we can compare the accuracy of sum operation by base R and mpfr()
from Rmpfr library for different ordering strategies. Notice that increasing order produces results closer to the ones seen in floating point aware summations, which corroborates the above statement from this book.
I computed chi square statistic using the raw data x
.
library('data.table') library('Rmpfr') x1 = matrix(c( 10^16, rep(1.1, 10000)), nrow = 10001, ncol = 1) df1 <- data.frame(x = x1) setDT(df1) df1[, p := rep(1/length(x), length(x))] s_x <- df1[, sum(x)] df1[, E := s_x * p] df1[, chi := ((x - E)^2/E)] precBits <- .Machine$double.digits x_chi <- data.frame( names = c("x_asc", "x_desc", "x_fp_asc", "x_fp_desc", "chi_asc", "chi_desc", "chi_fp_asc", "chi_fp_desc")) x_chi$vals <- c( ## x df1[order(x), format( sum(x), digits = 22)], df1[order(-x), format( sum(x), digits = 22)], df1[order(x), format( sum(mpfr(x, precBits = precBits)), digits = 22)], df1[order(-x), format( sum(mpfr(x, precBits = precBits)), digits = 22)], ## chi df1[order(chi), format( sum(chi), digits = 22)], df1[order(-chi), format( sum(chi), digits = 22)], df1[order(chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)], df1[order(-chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)]) x_chi # names vals # 1 x_asc 10000000000011000 # 2 x_desc 10000000000010996 # 3 x_fp_asc 10000000000011000.00000 # 4 x_fp_desc 10000000000020000.00000 # 5 chi_asc 99999999999890014218 # 6 chi_desc 99999999999890030592 # 7 chi_fp_asc 99999999999890014208.00 # 8 chi_fp_desc 99999999999833554944.00
Looking at the source code of edit(chisq.test)
function informs that there is no sorting operation involved in it.
Also, as pointed out in the comment section, it is not related to the sign (+ve or -ve) of values of raw data used in chisq.test()
function. This function does not accept negative values, so it will spit error by halting the function with this message "all entries of 'x' must be nonnegative and finite"
.
set.seed(2L) chisq.test(c(rnorm(10, 0, 1))) # Error in chisq.test(c(rnorm(10, 0, 1))) : # all entries of 'x' must be nonnegative and finite
The difference in the values while summing floating point numbers is concerned with the double precision arithmetic. See the demo below. When the precision of floating point numbers is maintained at 200 digits using the mpfr()
function available in Rmpfr
package, the sum operation gives identical results regardless of the order of the vector x1
or x2
. However, unequal values are observed when no floating point precision is maintained.
No FP Precision:
x1 = matrix(c( 10^16, rep(1.1, 10000)), nrow = 10001, ncol = 1) ## reverse x2 = matrix(c( rep(1.1, 10000), 10^16 ), nrow = 10001, ncol = 1) c( format(sum(x1), digits = 22), format(sum(x2), digits = 22)) # [1] "10000000000010996" "10000000000011000"
FP Precision maintained:
library('Rmpfr') ## prec <- 200 x1 = matrix(c( mpfr( 10^16, precBits = prec), rep( mpfr(1.1, precBits = prec), 10000)), nrow = 10001, ncol = 1) ## reverse x2 = matrix(c( rep(mpfr(1.1, precBits = prec), 10000), mpfr( 10^16, precBits = prec) ), nrow = 10001, ncol = 1) c( sum(x1), sum(x2)) # 2 'mpfr' numbers of precision 200 bits # [1] 10000000000011000.000000000000888178419700125232338905334472656 # [2] 10000000000011000.000000000000888178419700125232338905334472656
The smallest positive floating point number in base R can be obtained from the code below and any number smaller than this number will be truncated in base R, which produces varying results out of sum operation.
.Machine$double.eps # [1] 2.220446e-16
Comparison of double precision arithmetic aware and unaware functions of chisq.test()
function.
The relevant portion of chisq.test()
is extracted and a new function chisq.test2()
is made with it. Inside, you will see options for comparison before and after applying 250 digits double precision awareness using mpfr()
function to chi square statistic. You can see identical results are seen for floating point aware function, but not for the raw data.
# modified chi square function: chisq.test2 <- function (x, precBits) { if (is.matrix(x)) { if (min(dim(x)) == 1L) x <- as.vector(x) } #before fp precision p = rep(1/length(x), length(x)) n <- sum(x) E <- n * p # after fp precision x1 <- mpfr(x, precBits = precBits) p1 = rep(1/length(x1), length(x1)) n1 <- sum(x1) E1 <- n1 * p1 # chisquare statistic STATISTIC <- c(format(sum((x - E)^2/E), digits=22), # before decreasing format(sum(sort((x - E)^2/E, decreasing = FALSE)), digits=22), # before increasing sum((x1 - E1)^2/E1), # after decreasing sum(sort((x1 - E1)^2/E1, decreasing = FALSE))) # after increasing return(STATISTIC) } # data x1 = matrix(c( 10^16, rep(1.1, 10000)), nrow = 10001, ncol = 1) chisq.test2(x = x1, precBits=250)
Output:
# [[1]] # before fp decreasing # [1] "99999999999890030592" # # [[2]] # before fp increasing # [1] "99999999999890014218" # # [[3]] # after fp decreasing # 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230046685 # # [[4]] # after fp increasing # 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230251906
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With