Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does chisq.test sort data in descending order before summation

Tags:

Why does chisq.test function in R sorts data before summation in descending order?

The code in question is:

STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE)) 

If I was worried about numerical stability due to usage of float arithmetic and wanted to use some easy to deploy hack, I would sort the data in increasing order before summation to avoid adding a tiny value to large value in the accumulator (in order to avoid trimming of the least significant bits in the result as much as possible).

I looked into the source code of sum but it did not explain why to pass data in descending order to sum(). What do I miss?

An example:

x = matrix(1.1, 10001, 1) x[1] = 10^16   # We have a vector with 10000*1.1 and 1*10^16 c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE))) 

The result:

10000000000010996 10000000000011000 

When we sort the data in the ascending order, we get the correct result. If we sort the data in the descending order, we get a result that is off by 4.

like image 894
user824276 Avatar asked Dec 16 '17 15:12

user824276


People also ask

When would you want to sort in descending sequence?

If you want the data arranged in reverse alphabetical order, then specify descending order. where DESC indicates the rows are to be arranged in descending order.

Can data be sorted in descending sequence?

Data is typically sorted based on actual values, counts or percentages, in either ascending or descending order, but can also be sorted based on the variable value labels.

Which tool is used to arrange data in ascending descending order?

Sorting is one of the most common tools for data management. In Excel, you can sort your table by one or more columns, by ascending or descending order, or do a custom sort.

Is the value at the middle when the data is stored in descending order?

The median is the middle value in a series of numbers which is first sorted in either ascending or descending order. When there are extremes in the series that could affect the mean of the numbers, the median can be utilised instead of the mean.


1 Answers

EDIT: The book "Accuracy and stability of numerical algorithms by Nicolas J. Higham" states that

"when summing nonnegative numbers by recursive summation the increasing ordering is the best ordering, in the sense of having the smallest a priori forward error bound."

Thanks @Lamia for sharing the book in the comment section.

This books explains three methods of summation such as recursive, insertion and pairwise techniques. Each technique has its own merits and demerits based on the magnitude of the error bound associated with them, which can be computed by doing systematic error analysis of summation of floating point numbers.

Notably, the summation results from recursive technique is dependent on the ordering strategies such as increasing, decreasing and Psum (look in the book - page 82 - 4th paragraph. also see how it works in the example given at the bottom of page 82.).

Looking at the R source code for the sum() function which can be obtained from summary.c informs that R implements recursive method in its sum() function.

Also the number of base digits in the floating-point significand is 53, which can be obtained from

.Machine$double.digits # [1] 53 

By setting this number as the precision bits, we can compare the accuracy of sum operation by base R and mpfr() from Rmpfr library for different ordering strategies. Notice that increasing order produces results closer to the ones seen in floating point aware summations, which corroborates the above statement from this book.

I computed chi square statistic using the raw data x.

library('data.table') library('Rmpfr') x1 = matrix(c( 10^16, rep(1.1, 10000)),              nrow = 10001, ncol = 1) df1 <- data.frame(x = x1) setDT(df1) df1[, p := rep(1/length(x), length(x))] s_x <- df1[, sum(x)] df1[, E := s_x * p] df1[, chi := ((x - E)^2/E)]  precBits <- .Machine$double.digits x_chi <- data.frame( names = c("x_asc", "x_desc", "x_fp_asc", "x_fp_desc",                                "chi_asc", "chi_desc", "chi_fp_asc", "chi_fp_desc")) x_chi$vals <- c( ## x   df1[order(x), format( sum(x), digits = 22)],   df1[order(-x), format( sum(x), digits = 22)],   df1[order(x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],   df1[order(-x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],   ## chi   df1[order(chi), format( sum(chi), digits = 22)],   df1[order(-chi), format( sum(chi), digits = 22)],   df1[order(chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)],   df1[order(-chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)])  x_chi #         names                    vals # 1       x_asc       10000000000011000 # 2      x_desc       10000000000010996 # 3    x_fp_asc 10000000000011000.00000 # 4   x_fp_desc 10000000000020000.00000 # 5     chi_asc    99999999999890014218 # 6    chi_desc    99999999999890030592 # 7  chi_fp_asc 99999999999890014208.00 # 8 chi_fp_desc 99999999999833554944.00 

Looking at the source code of edit(chisq.test) function informs that there is no sorting operation involved in it.

Also, as pointed out in the comment section, it is not related to the sign (+ve or -ve) of values of raw data used in chisq.test() function. This function does not accept negative values, so it will spit error by halting the function with this message "all entries of 'x' must be nonnegative and finite".

set.seed(2L) chisq.test(c(rnorm(10, 0, 1))) # Error in chisq.test(c(rnorm(10, 0, 1))) :  #   all entries of 'x' must be nonnegative and finite 

The difference in the values while summing floating point numbers is concerned with the double precision arithmetic. See the demo below. When the precision of floating point numbers is maintained at 200 digits using the mpfr() function available in Rmpfr package, the sum operation gives identical results regardless of the order of the vector x1 or x2. However, unequal values are observed when no floating point precision is maintained.

No FP Precision:

x1 = matrix(c( 10^16, rep(1.1, 10000)),              nrow = 10001, ncol = 1) ## reverse x2 = matrix(c( rep(1.1, 10000), 10^16 ),              nrow = 10001, ncol = 1)  c( format(sum(x1), digits = 22),     format(sum(x2), digits = 22)) # [1] "10000000000010996" "10000000000011000" 

FP Precision maintained:

library('Rmpfr') ## prec <- 200 x1 = matrix(c( mpfr( 10^16, precBits = prec),               rep( mpfr(1.1, precBits = prec), 10000)),             nrow = 10001, ncol = 1)  ## reverse x2 = matrix(c( rep(mpfr(1.1, precBits = prec), 10000),                mpfr( 10^16, precBits = prec) ),             nrow = 10001, ncol = 1) c( sum(x1), sum(x2)) # 2 'mpfr' numbers of precision  200   bits  # [1] 10000000000011000.000000000000888178419700125232338905334472656 # [2] 10000000000011000.000000000000888178419700125232338905334472656 

The smallest positive floating point number in base R can be obtained from the code below and any number smaller than this number will be truncated in base R, which produces varying results out of sum operation.

.Machine$double.eps # [1] 2.220446e-16 

Comparison of double precision arithmetic aware and unaware functions of chisq.test() function.

The relevant portion of chisq.test() is extracted and a new function chisq.test2() is made with it. Inside, you will see options for comparison before and after applying 250 digits double precision awareness using mpfr() function to chi square statistic. You can see identical results are seen for floating point aware function, but not for the raw data.

# modified chi square function: chisq.test2 <- function (x, precBits)  {   if (is.matrix(x)) {     if (min(dim(x)) == 1L)        x <- as.vector(x)   }    #before fp precision   p = rep(1/length(x), length(x))   n <- sum(x)   E <- n * p    # after fp precision   x1 <- mpfr(x, precBits = precBits)   p1 = rep(1/length(x1), length(x1))   n1 <- sum(x1)   E1 <- n1 * p1    # chisquare statistic   STATISTIC <- c(format(sum((x - E)^2/E), digits=22),           # before decreasing                  format(sum(sort((x - E)^2/E, decreasing = FALSE)), digits=22), # before increasing                  sum((x1 - E1)^2/E1),                           # after decreasing                   sum(sort((x1 - E1)^2/E1, decreasing = FALSE))) # after increasing    return(STATISTIC) }  # data x1 = matrix(c( 10^16, rep(1.1, 10000)),              nrow = 10001, ncol = 1)  chisq.test2(x = x1, precBits=250) 

Output:

# [[1]]  # before fp decreasing # [1] "99999999999890030592" #  # [[2]]  # before fp increasing # [1] "99999999999890014218" #  # [[3]]  # after fp decreasing  # 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230046685 #  # [[4]]  # after fp increasing # 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230251906 
like image 105
Sathish Avatar answered Sep 19 '22 13:09

Sathish