Why does <code>chisq.test</code> function in R sorts data before summation in descending order? The code in question is: <pre class="prettyprint"><code>STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE)) </code></pre> If I was worried about numerical stability due to usage of float arithmetic and wanted to use some easy to deploy hack, I would sort the data in increasing order before summation to avoid adding a tiny value to large value in the accumulator (in order to avoid trimming of the least significant bits in the result as much as possible). I looked into the source code of sum but it did not explain why to pass data in descending order to <code>sum()</code>. What do I miss? An example: <pre class="prettyprint"><code>x = matrix(1.1, 10001, 1) x[1] = 10^16 # We have a vector with 10000*1.1 and 1*10^16 c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE))) </code></pre> The result: <pre class="prettyprint"><code>10000000000010996 10000000000011000 </code></pre> When we sort the data in the ascending order, we get the correct result. If we sort the data in the descending order, we get a result that is off by 4.

EDIT: The book "Accuracy and stability of numerical algorithms by Nicolas J. Higham" states that <blockquote> "when summing nonnegative numbers by recursive summation the increasing ordering is the best ordering, in the sense of having the smallest a priori forward error bound." </blockquote> Thanks @Lamia for sharing the book in the comment section. This books explains three methods of summation such as recursive, insertion and pairwise techniques. Each technique has its own merits and demerits based on the magnitude of the error bound associated with them, which can be computed by doing systematic error analysis of summation of floating point numbers. Notably, the summation results from recursive technique is dependent on the ordering strategies such as increasing, decreasing and Psum (look in the book - page 82 - 4th paragraph. also see how it works in the example given at the bottom of page 82.). Looking at the R source code for the <code>sum()</code> function which can be obtained from summary.c informs that R implements recursive method in its <code>sum()</code> function. Also the number of base digits in the floating-point significand is 53, which can be obtained from <pre class="prettyprint"><code>.Machine$double.digits # [1] 53 </code></pre> By setting this number as the precision bits, we can compare the accuracy of sum operation by base R and <code>mpfr()</code> from Rmpfr library for different ordering strategies. Notice that increasing order produces results closer to the ones seen in floating point aware summations, which corroborates the above statement from this book. I computed chi square statistic using the raw data <code>x</code>. <pre class="prettyprint"><code>library('data.table') library('Rmpfr') x1 = matrix(c( 10^16, rep(1.1, 10000)), nrow = 10001, ncol = 1) df1 <- data.frame(x = x1) setDT(df1) df1[, p := rep(1/length(x), length(x))] s_x <- df1[, sum(x)] df1[, E := s_x * p] df1[, chi := ((x - E)^2/E)] precBits <- .Machine$double.digits x_chi <- data.frame( names = c("x_asc", "x_desc", "x_fp_asc", "x_fp_desc", "chi_asc", "chi_desc", "chi_fp_asc", "chi_fp_desc")) x_chi$vals <- c( ## x df1[order(x), format( sum(x), digits = 22)], df1[order(-x), format( sum(x), digits = 22)], df1[order(x), format( sum(mpfr(x, precBits = precBits)), digits = 22)], df1[order(-x), format( sum(mpfr(x, precBits = precBits)), digits = 22)], ## chi df1[order(chi), format( sum(chi), digits = 22)], df1[order(-chi), format( sum(chi), digits = 22)], df1[order(chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)], df1[order(-chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)]) x_chi # names vals # 1 x_asc 10000000000011000 # 2 x_desc 10000000000010996 # 3 x_fp_asc 10000000000011000.00000 # 4 x_fp_desc 10000000000020000.00000 # 5 chi_asc 99999999999890014218 # 6 chi_desc 99999999999890030592 # 7 chi_fp_asc 99999999999890014208.00 # 8 chi_fp_desc 99999999999833554944.00 </code></pre> <hr> Looking at the source code of <code>edit(chisq.test)</code> function informs that there is no sorting operation involved in it. Also, as pointed out in the comment section, it is not related to the sign (+ve or -ve) of values of raw data used in <code>chisq.test()</code> function. This function does not accept negative values, so it will spit error by halting the function with this message <code>"all entries of 'x' must be nonnegative and finite"</code>. <pre class="prettyprint"><code>set.seed(2L) chisq.test(c(rnorm(10, 0, 1))) # Error in chisq.test(c(rnorm(10, 0, 1))) : # all entries of 'x' must be nonnegative and finite </code></pre> The difference in the values while summing floating point numbers is concerned with the double precision arithmetic. See the demo below. When the precision of floating point numbers is maintained at 200 digits using the <code>mpfr()</code> function available in <code>Rmpfr</code> package, the sum operation gives identical results regardless of the order of the vector <code>x1</code> or <code>x2</code>. However, unequal values are observed when no floating point precision is maintained. No FP Precision: <pre class="prettyprint"><code>x1 = matrix(c( 10^16, rep(1.1, 10000)), nrow = 10001, ncol = 1) ## reverse x2 = matrix(c( rep(1.1, 10000), 10^16 ), nrow = 10001, ncol = 1) c( format(sum(x1), digits = 22), format(sum(x2), digits = 22)) # [1] "10000000000010996" "10000000000011000" </code></pre> FP Precision maintained: <pre class="prettyprint"><code>library('Rmpfr') ## prec <- 200 x1 = matrix(c( mpfr( 10^16, precBits = prec), rep( mpfr(1.1, precBits = prec), 10000)), nrow = 10001, ncol = 1) ## reverse x2 = matrix(c( rep(mpfr(1.1, precBits = prec), 10000), mpfr( 10^16, precBits = prec) ), nrow = 10001, ncol = 1) c( sum(x1), sum(x2)) # 2 'mpfr' numbers of precision 200 bits # [1] 10000000000011000.000000000000888178419700125232338905334472656 # [2] 10000000000011000.000000000000888178419700125232338905334472656 </code></pre> The smallest positive floating point number in base R can be obtained from the code below and any number smaller than this number will be truncated in base R, which produces varying results out of sum operation. <pre class="prettyprint"><code>.Machine$double.eps # [1] 2.220446e-16 </code></pre> Comparison of double precision arithmetic aware and unaware functions of <code>chisq.test()</code> function. The relevant portion of <code>chisq.test()</code> is extracted and a new function <code>chisq.test2()</code> is made with it. Inside, you will see options for comparison before and after applying 250 digits double precision awareness using <code>mpfr()</code> function to chi square statistic. You can see identical results are seen for floating point aware function, but not for the raw data. <pre class="prettyprint"><code># modified chi square function: chisq.test2 <- function (x, precBits) { if (is.matrix(x)) { if (min(dim(x)) == 1L) x <- as.vector(x) } #before fp precision p = rep(1/length(x), length(x)) n <- sum(x) E <- n * p # after fp precision x1 <- mpfr(x, precBits = precBits) p1 = rep(1/length(x1), length(x1)) n1 <- sum(x1) E1 <- n1 * p1 # chisquare statistic STATISTIC <- c(format(sum((x - E)^2/E), digits=22), # before decreasing format(sum(sort((x - E)^2/E, decreasing = FALSE)), digits=22), # before increasing sum((x1 - E1)^2/E1), # after decreasing sum(sort((x1 - E1)^2/E1, decreasing = FALSE))) # after increasing return(STATISTIC) } # data x1 = matrix(c( 10^16, rep(1.1, 10000)), nrow = 10001, ncol = 1) chisq.test2(x = x1, precBits=250) </code></pre> Output: <pre class="prettyprint"><code># [[1]] # before fp decreasing # [1] "99999999999890030592" # # [[2]] # before fp increasing # [1] "99999999999890014218" # # [[3]] # after fp decreasing # 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230046685 # # [[4]] # after fp increasing # 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230251906 </code></pre>

Why does chisq.test sort data in descending order before summation

Tags:

Why does chisq.test function in R sorts data before summation in descending order?

The code in question is:

STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))

If I was worried about numerical stability due to usage of float arithmetic and wanted to use some easy to deploy hack, I would sort the data in increasing order before summation to avoid adding a tiny value to large value in the accumulator (in order to avoid trimming of the least significant bits in the result as much as possible).

I looked into the source code of sum but it did not explain why to pass data in descending order to sum(). What do I miss?

An example:

x = matrix(1.1, 10001, 1) x[1] = 10^16   # We have a vector with 10000*1.1 and 1*10^16 c(sum(sort(x, decreasing = TRUE)), sum(sort(x, decreasing = FALSE)))

The result:

10000000000010996 10000000000011000

When we sort the data in the ascending order, we get the correct result. If we sort the data in the descending order, we get a result that is off by 4.

894

asked Dec 16 '17 15:12

user824276

1 Answers

EDIT: The book "Accuracy and stability of numerical algorithms by Nicolas J. Higham" states that

"when summing nonnegative numbers by recursive summation the increasing ordering is the best ordering, in the sense of having the smallest a priori forward error bound."

Thanks @Lamia for sharing the book in the comment section.

This books explains three methods of summation such as recursive, insertion and pairwise techniques. Each technique has its own merits and demerits based on the magnitude of the error bound associated with them, which can be computed by doing systematic error analysis of summation of floating point numbers.

Notably, the summation results from recursive technique is dependent on the ordering strategies such as increasing, decreasing and Psum (look in the book - page 82 - 4th paragraph. also see how it works in the example given at the bottom of page 82.).

Looking at the R source code for the sum() function which can be obtained from summary.c informs that R implements recursive method in its sum() function.

Also the number of base digits in the floating-point significand is 53, which can be obtained from

.Machine$double.digits # [1] 53

By setting this number as the precision bits, we can compare the accuracy of sum operation by base R and mpfr() from Rmpfr library for different ordering strategies. Notice that increasing order produces results closer to the ones seen in floating point aware summations, which corroborates the above statement from this book.

I computed chi square statistic using the raw data x.

library('data.table') library('Rmpfr') x1 = matrix(c( 10^16, rep(1.1, 10000)),              nrow = 10001, ncol = 1) df1 <- data.frame(x = x1) setDT(df1) df1[, p := rep(1/length(x), length(x))] s_x <- df1[, sum(x)] df1[, E := s_x * p] df1[, chi := ((x - E)^2/E)]  precBits <- .Machine$double.digits x_chi <- data.frame( names = c("x_asc", "x_desc", "x_fp_asc", "x_fp_desc",                                "chi_asc", "chi_desc", "chi_fp_asc", "chi_fp_desc")) x_chi$vals <- c( ## x   df1[order(x), format( sum(x), digits = 22)],   df1[order(-x), format( sum(x), digits = 22)],   df1[order(x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],   df1[order(-x), format( sum(mpfr(x, precBits = precBits)), digits = 22)],   ## chi   df1[order(chi), format( sum(chi), digits = 22)],   df1[order(-chi), format( sum(chi), digits = 22)],   df1[order(chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)],   df1[order(-chi), format( sum(mpfr(chi, precBits = precBits)), digits = 22)])  x_chi #         names                    vals # 1       x_asc       10000000000011000 # 2      x_desc       10000000000010996 # 3    x_fp_asc 10000000000011000.00000 # 4   x_fp_desc 10000000000020000.00000 # 5     chi_asc    99999999999890014218 # 6    chi_desc    99999999999890030592 # 7  chi_fp_asc 99999999999890014208.00 # 8 chi_fp_desc 99999999999833554944.00

Looking at the source code of edit(chisq.test) function informs that there is no sorting operation involved in it.

Also, as pointed out in the comment section, it is not related to the sign (+ve or -ve) of values of raw data used in chisq.test() function. This function does not accept negative values, so it will spit error by halting the function with this message "all entries of 'x' must be nonnegative and finite".

set.seed(2L) chisq.test(c(rnorm(10, 0, 1))) # Error in chisq.test(c(rnorm(10, 0, 1))) :  #   all entries of 'x' must be nonnegative and finite

The difference in the values while summing floating point numbers is concerned with the double precision arithmetic. See the demo below. When the precision of floating point numbers is maintained at 200 digits using the mpfr() function available in Rmpfr package, the sum operation gives identical results regardless of the order of the vector x1 or x2. However, unequal values are observed when no floating point precision is maintained.

No FP Precision:

x1 = matrix(c( 10^16, rep(1.1, 10000)),              nrow = 10001, ncol = 1) ## reverse x2 = matrix(c( rep(1.1, 10000), 10^16 ),              nrow = 10001, ncol = 1)  c( format(sum(x1), digits = 22),     format(sum(x2), digits = 22)) # [1] "10000000000010996" "10000000000011000"

FP Precision maintained:

library('Rmpfr') ## prec <- 200 x1 = matrix(c( mpfr( 10^16, precBits = prec),               rep( mpfr(1.1, precBits = prec), 10000)),             nrow = 10001, ncol = 1)  ## reverse x2 = matrix(c( rep(mpfr(1.1, precBits = prec), 10000),                mpfr( 10^16, precBits = prec) ),             nrow = 10001, ncol = 1) c( sum(x1), sum(x2)) # 2 'mpfr' numbers of precision  200   bits  # [1] 10000000000011000.000000000000888178419700125232338905334472656 # [2] 10000000000011000.000000000000888178419700125232338905334472656

The smallest positive floating point number in base R can be obtained from the code below and any number smaller than this number will be truncated in base R, which produces varying results out of sum operation.

.Machine$double.eps # [1] 2.220446e-16

Comparison of double precision arithmetic aware and unaware functions of chisq.test() function.

The relevant portion of chisq.test() is extracted and a new function chisq.test2() is made with it. Inside, you will see options for comparison before and after applying 250 digits double precision awareness using mpfr() function to chi square statistic. You can see identical results are seen for floating point aware function, but not for the raw data.

# modified chi square function: chisq.test2 <- function (x, precBits)  {   if (is.matrix(x)) {     if (min(dim(x)) == 1L)        x <- as.vector(x)   }    #before fp precision   p = rep(1/length(x), length(x))   n <- sum(x)   E <- n * p    # after fp precision   x1 <- mpfr(x, precBits = precBits)   p1 = rep(1/length(x1), length(x1))   n1 <- sum(x1)   E1 <- n1 * p1    # chisquare statistic   STATISTIC <- c(format(sum((x - E)^2/E), digits=22),           # before decreasing                  format(sum(sort((x - E)^2/E, decreasing = FALSE)), digits=22), # before increasing                  sum((x1 - E1)^2/E1),                           # after decreasing                   sum(sort((x1 - E1)^2/E1, decreasing = FALSE))) # after increasing    return(STATISTIC) }  # data x1 = matrix(c( 10^16, rep(1.1, 10000)),              nrow = 10001, ncol = 1)  chisq.test2(x = x1, precBits=250)

Output:

# [[1]]  # before fp decreasing # [1] "99999999999890030592" #  # [[2]]  # before fp increasing # [1] "99999999999890014218" #  # [[3]]  # after fp decreasing  # 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230046685 #  # [[4]]  # after fp increasing # 'mpfr1' 99999999999889972569.502489584522352514811399898444554440067408531548230251906

105

answered Sep 19 '22 13:09

Sathish

Related questions
                            
                                Why does it keep lib/modules locked?
                            
                                Under what circumstances does the read() syscall return 0?
                            
                                VSCode typescript importing json file highlighting issue
                            
                                Using ASP.NET Core Web API WITHOUT Entity Framework
                            
                                ValueTuple naming conventions [closed]
                            
                                Airflow signals SIGTERM to subprocesses unexpectedly
                            
                                Tax Rate in new Stripe Checkout
                            
                                Swift Package Manager tools version for development snapshots
                            
                                .htaccess directives to *not* redirect certain URLs
                            
                                Use Javascript to change which submit is activated on enter key press
                            
                                Implementing a node-based graphical interface?
                            
                                Bi-directional dictionary? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With