Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

logspline produces incosistent results

Tags:

r

I need my R scripts to always produce the same results for the same input. However, I've noticed that the function logspline from the package logspline produces different output on windows OS and ubuntu OS. Both uses logspline package 2.1.16, though.

Here's the code:

x <- c(0.3453205379,0.3497529927,0.3460179029,0.3433414591,0.3565925053,0.3318019585,0.3322091870,0.3314076990,0.3413768315,0.3305650805,0.3342775671,0.3362692445,0.3321054345,0.5982416984,0.5509602046,0.6600000000,0.3339818725,0.3307459063,0.3314632807,0.3356930476,0.3300000000,0.3324504116,0.3470739551,0.3441385006,0.3316070520,0.3399635743,0.3316989471,0.3308044524,0.3536822479,0.3315414656)
fnc <- logspline(x)
plot.logspline(fnc)

and here's the result on windows - fitted with 7 knots: enter image description here and here the result on ubuntu - fitted with 5 knots: enter image description here

Could someone explain me the reason of the difference, please? And is there a way to force the function to produce consistent results in any environment?

Output logspline(x, error.action=0) on windows:

 knots A(1)/D(2) loglik     AIC minimum penalty maximum penalty
     4         2  40.53  -70.85          113.83             Inf
     5         2  97.44 -181.28            5.48          113.83
     6         2  99.48 -181.96              NA              NA
     7         2 102.92 -185.44           -0.13            5.48
     8         1 102.86 -181.90              NA              NA
the present optimal number of knots is  7 
penalty(AIC) was the default: BIC=log(samplesize): log( 30 )= 3.4 

Output logspline(x, error.action=0) on linux:

 knots A(1)/D(2) loglik     AIC minimum penalty maximum penalty
     4         2  87.33 -164.46            8.38             Inf
     5         2  91.52 -169.44            2.07            8.38
     6         1  91.95 -166.88              NA              NA
     7         1  93.60 -166.78            0.00            2.07
     8         1  93.60 -163.38              NA              NA
the present optimal number of knots is  5 
penalty(AIC) was the default: BIC=log(samplesize): log( 30 )= 3.4

Output sessionInfo() on windows:

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] logspline_2.1.16

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0  

Output sessionInfo() on linux:

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8    
 [5] LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C             
 [9] LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] logspline_2.1.16

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0 
like image 457
dpelisek Avatar asked May 12 '21 06:05

dpelisek


2 Answers

Not a complete answer, but this was way too long for a comment.

I tried different systems and this is what I found:

  • docker: rocker/r-base:4.1.0 returns 5 knots (gcc and gfortran version 10.2.1 LAPACK openblas-pthread/libopenblasp-r0.3.13)
  • docker: rocker/r-base:3.6.3 returns 5 knots (gcc and gfortran version 9.2.1-30 LAPACK lapack/liblapack.so.3.9.0)
  • Fresh docker install Ubuntu 21.04 (alt 20.04) with 4.1.0 returns 5 knots (gcc and gfortran version 10.3.0 (9.3.0 for ubuntu:20.04) LAPACK lapack/liblapack.so.3.9.0)
  • Windows 4.1.0 returns 7 knots (not sure which gcc, gfortran, LAPACK version was used)
  • Ubuntu 18.04 (docker) ubuntu:18.04 with R 4.1.0 returns 7 knots (gcc and gfortran version 7.5.0, LAPACK version liblapack.so.3.7.1, alternatively tested with /atlas/liblapack.so.3.10.3, same results)
  • Ubuntu 18.04 with 4.0.2 returns 7 knots (gcc and gfortran version 7.5.0 LAPACK libopenblasp-r0.2.20)

The package logspline uses multiple fortran and C functions, one of them being nlogcensor. The code below calculates many things, among others the nknots.

MWE for the internal code in logspline::logspline (line 483ff of R/logspline.R in version 0.2.8 (github), for the CRAN version (current 0.2.16) it is line 486ff).

logspline() calls in nlogcensor() a C function which is defined in src/nlsd.c at line 166ff (latest version line 167ff).

library(logspline)
intpars <- c(30L, 0L, 0L, 0L, 1L, 1L, -1L)
data <- c(0.33, 0.3305650805, 0.3307459063, 0.3308044524, 0.331407699, 
          0.3314632807, 0.3315414656, 0.331607052, 0.3316989471, 0.3318019585, 
          0.3321054345, 0.332209187, 0.3324504116, 0.3339818725, 0.3342775671, 
          0.3356930476, 0.3362692445, 0.3399635743, 0.3413768315, 0.3433414591, 
          0.3441385006, 0.3453205379, 0.3460179029, 0.3470739551, 0.3497529927, 
          0.3536822479, 0.3565925053, 0.5509602046, 0.5982416984, 0.66, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
          0, 0)
dpars <- c(-1, 0, 0)
maxp <- 65
kts <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0)

# actual code
z <- .C("nlogcensor",
        ip = as.integer(intpars),
        coef = as.double(data),
        dp = as.double(dpars),
        logl = as.double(rep(0, maxp)),
        ad = as.integer(rep(0, maxp)),
        kts = as.double(kts),
        PACKAGE = "logspline")

z$ip[2]
#> 7 some older systems, 5 for newer systems

This narrows down the source of error. And this is where my current journey ends. Unfortunately, I don't have much C experience and no FORTRAN experience.

I tried to follow the C routines, but the code is little formatted and hard to read for me. In line 328 of nlsd.c (version 0.2.16 line 335), the final nknots is assigned as intpars[1]=(*spc).nk; (C is zero indexed, R is 1 indexed).

Maybe you can reprodue the error using the nlogcensor code call and confirm different different Fortran or LAPACK toolchains.

Answer Attempt

As can be seen from the text above, there is no simple way to make the package fully reproducible. That is, set.seed(123) and controlling the version of R and the package itself does not help as I suspect the toolchain version is different and leads to the differences.

The results from the different tests above suggest, that if you have a liblapack version >= 3.9.0 (or libopenblasp >= r0.3.13), you get 5 knots. Older versions lead to 7 knots.

As you usually have little control over this, you might want to look into docker and use a specified version, e.g., rocker/r-base:4.1.0. This might be a bit of an overkill for this problem (not sure how or for what you use the code), but this is the only way I currently see you can achieve full reproducibility as you have control over the toolchain as well.

Hope that clears your problem a bit or points you to a potential solution.

Docker Scripts

In case you want to replicate the results:

For the rocker/r-base images, just use docker run --rm -it rocker/r-base:4.1.0 and then within R install.packages("logspline") and the MWE code from above.

The full install script for ubuntu is a bit more complicated, see below:

docker run --rm -it ubuntu:18.04

# within the container

# install dependencies
apt update && apt upgrade && \
  apt install --no-install-recommends -y software-properties-common dirmngr wget build-essential gfortran 

# install r-base
wget -qO- https://cran.rstudio.com/bin/linux/ubuntu/marutter_pubkey.asc | \
  tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc && \
  add-apt-repository "deb https://cran.rstudio.com/bin/linux/ubuntu $(lsb_release -cs)-cran40/" && \
  apt install --no-install-recommends -y r-base

# install lapack and blas
apt install libblas-dev liblapack-dev
# alternatively use this: libatlas-base-dev 

# install logspline
R -e "install.packages('logspline', repos='https://cran.rstudio.com')"

# see versions
gcc --version
gfortran --version
R -e "sessionInfo()" | grep usr

# run R
R

and then use the code from the MWE above.

like image 147
David Avatar answered Oct 18 '22 10:10

David


The number of knots should be the same across machines to test if there is a difference. You must keep the number of knots the same because this will also change the statistical inference as well.

x <- c(0.3453205379,0.3497529927,0.3460179029,0.3433414591,0.3565925053,0.3318019585,0.3322091870,0.3314076990,0.3413768315,0.3305650805,0.3342775671,0.3362692445,0.3321054345,0.5982416984,0.5509602046,0.6600000000,0.3339818725,0.3307459063,0.3314632807,0.3356930476,0.3300000000,0.3324504116,0.3470739551,0.3441385006,0.3316070520,0.3399635743,0.3316989471,0.3308044524,0.3536822479,0.3315414656)
fnc1 <- logspline(x, nknots = 5)
plot.logspline(fnc1)

nknots 5

fnc2 <- logspline(x, nknots = 7)
plot.logspline(fnc2)

nknots 7

See I am testing only on windows but when you change the number of knots the graphics change.

like image 34
Mike Avatar answered Oct 18 '22 08:10

Mike