I just noticed that <code>read_csv()</code> somehow uses random numbers which is unexpected (at least to me). The corresponding base R function <code>read.csv()</code> does not do that. So, what does <code>read_csv()</code> use the random numbers for? I looked into the documentation but could not find a clear answer to that. Are the random numbers related to the <code>guess_max</code> argument? <pre class="prettyprint"><code>library(tidyverse) set.seed(123) rnorm(1) # [1] -0.5604756 set.seed(123) dat <- read.csv("data/titanic.csv") rnorm(1) # [1] -0.5604756 set.seed(123) dat <- read_csv("data/titanic.csv") rnorm(1) #[1] 1.239496 </code></pre> EDIT: <ol> <li>As suggested by rawr's comment, I tried specifying <code>col_types</code> and indeed it worked. But still I wonder why this is happening. Anyone got an explanation?</li> </ol> <pre class="prettyprint"><code>set.seed(123) dat <- read_csv("data/titanic.csv", col_types = c("dddccdddcdcc")) rnorm(1) #[1] -0.5604756 </code></pre> <ol start="2"> <li>Since a lot of people asked about the R and <code>readr</code> version, here is my session info.</li> </ol> <pre class="prettyprint lang-r prettyprint-override"><code>library(readr) sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.0.5 (2021-03-31) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Germany.1252 #> ctype German_Germany.1252 #> tz Europe/Berlin #> date 2021-06-10 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.4) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.3) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.3) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.5) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3) #> highr 0.9 2021-04-16 [1] CRAN (R 4.0.5) #> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.5) #> htmltools 0.5.1.9003 2021-05-07 [1] Github (rstudio/htmltools@e12171e) #> knitr 1.33 2021-04-24 [1] CRAN (R 4.0.5) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.0.5) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3) #> ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.5) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.5) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5) #> rlang 0.4.11.9000 2021-05-29 [1] Github (r-lib/rlang@7797cdf) #> rmarkdown 2.8.1 2021-05-07 [1] Github (rstudio/rmarkdown@e98207f) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.3) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.3) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.3) #> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.0.5) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.3) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.3) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5) #> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.5) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.3) #> #> [1] C:/Users/Albert/Documents/R/win-library/4.0 #> [2] C:/Program Files/R/R-4.0.5/library </code></pre> Created on 2021-06-10 by the reprex package (v2.0.0)

tl;dr somewhere deep in the guts of the <code>cli</code> package (called to generate the pretty-printed output about column types), the code is generating a random string to use as a label. <hr> A major clue is that <pre class="prettyprint lang-r prettyprint-override"><code>set.seed(123); dat <- read_csv("iris.csv", col_types=cols()); rnorm(1) </code></pre> runs <code>read_csv</code> guessing the column types but without printing information about the guesses; this doesn't hit the RNG, which makes me think it's something in the fancy colour printing. By making a copy of the random seed info (<code>R <- .Random.seed</code>) and stepping through the code (<code>debug(readr::show_cols_spec)</code>) and periodically running <code>identical(R, .Random.seed)</code> to check on the status, I found that the random info changes after running <pre class="prettyprint lang-r prettyprint-override"><code>cli::cli_h1("Column specification") </code></pre> Debugging into that function, the change occurs somewhere in <code>cli::cli__message</code>; specifically, right before we execute this line <pre class="prettyprint lang-r prettyprint-override"><code> if ("id" %in% names(args) && is.null(args$id)) args$id <- new_uuid() </code></pre> (which is here in the source code of <code>cli</code>), <code>identical(R, .Random.seed)</code> is still TRUE; after running it, it's FALSE. More specifically, all we have to do at this point is evaluate the <code>args</code> argument (e.g. by typing <code>args</code> in the debugger). Working our way back up the chain and trying things by hand, we can see that manually evaluating <pre class="prettyprint lang-r prettyprint-override"><code>glue_cmd(text, .envir = .envir) </code></pre> at this point in the code changes the random info. Still more stepping through takes us to a point within <code>glue_cmd</code> where we call <code>make_cmd_transformer</code> where at this point we call a function called <code>random_id()</code>: <pre class="prettyprint lang-r prettyprint-override"><code>values$marker <- random_id() </code></pre> <code>random_id()</code> then calls <code>sample</code> ... I have no idea why this internal bit of <code>cli</code> needs to be generating a random string, but I guess you could ask the maintainers? <hr> This was done using <code>readr</code> 1.4.0 and <code>cli</code> 2.5.0 (although the code references are to the current version on GitHub [10 June 2021]).

What does read_csv() use random numbers for?

Tags:

r

readr

I just noticed that read_csv() somehow uses random numbers which is unexpected (at least to me). The corresponding base R function read.csv() does not do that. So, what does read_csv() use the random numbers for? I looked into the documentation but could not find a clear answer to that. Are the random numbers related to the guess_max argument?

library(tidyverse)
set.seed(123)
rnorm(1)
# [1] -0.5604756

set.seed(123)
dat <- read.csv("data/titanic.csv")
rnorm(1)
# [1] -0.5604756

set.seed(123)
dat <- read_csv("data/titanic.csv")
rnorm(1)
#[1] 1.239496

EDIT:

As suggested by rawr's comment, I tried specifying col_types and indeed it worked. But still I wonder why this is happening. Anyone got an explanation?

set.seed(123)
dat <- read_csv("data/titanic.csv", col_types = c("dddccdddcdcc"))
rnorm(1)
#[1] -0.5604756

Since a lot of people asked about the R and readr version, here is my session info.

library(readr)
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.5 (2021-03-31)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2021-06-10                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version     date       lib source                            
#>  cli           2.5.0       2021-04-26 [1] CRAN (R 4.0.3)                    
#>  crayon        1.4.1       2021-02-08 [1] CRAN (R 4.0.4)                    
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 4.0.3)                    
#>  ellipsis      0.3.2       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.0.3)                    
#>  fansi         0.5.0       2021-05-25 [1] CRAN (R 4.0.5)                    
#>  fastmap       1.1.0       2021-01-25 [1] CRAN (R 4.0.5)                    
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.0.3)                    
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.0.3)                    
#>  highr         0.9         2021-04-16 [1] CRAN (R 4.0.5)                    
#>  hms           1.0.0       2021-01-13 [1] CRAN (R 4.0.5)                    
#>  htmltools     0.5.1.9003  2021-05-07 [1] Github (rstudio/htmltools@e12171e)
#>  knitr         1.33        2021-04-24 [1] CRAN (R 4.0.5)                    
#>  lifecycle     1.0.0       2021-02-15 [1] CRAN (R 4.0.4)                    
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.0.3)                    
#>  pillar        1.6.1       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.0.3)                    
#>  ps            1.6.0       2021-02-28 [1] CRAN (R 4.0.5)                    
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 4.0.3)                    
#>  readr       * 1.4.0       2020-10-05 [1] CRAN (R 4.0.5)                    
#>  reprex        2.0.0       2021-04-02 [1] CRAN (R 4.0.5)                    
#>  rlang         0.4.11.9000 2021-05-29 [1] Github (r-lib/rlang@7797cdf)      
#>  rmarkdown     2.8.1       2021-05-07 [1] Github (rstudio/rmarkdown@e98207f)
#>  rstudioapi    0.13        2020-11-12 [1] CRAN (R 4.0.3)                    
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.0.3)                    
#>  stringi       1.5.3       2020-09-09 [1] CRAN (R 4.0.3)                    
#>  stringr       1.4.0       2019-02-10 [1] CRAN (R 4.0.3)                    
#>  tibble        3.1.2       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  utf8          1.2.1       2021-03-12 [1] CRAN (R 4.0.3)                    
#>  vctrs         0.3.8       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  withr         2.4.2       2021-04-18 [1] CRAN (R 4.0.5)                    
#>  xfun          0.22        2021-03-11 [1] CRAN (R 4.0.5)                    
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.0.3)                    
#> 
#> [1] C:/Users/Albert/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.5/library

^{Created on 2021-06-10 by the reprex package (v2.0.0)}

386

asked Jun 09 '21 17:06

AlbertRapp

1 Answers

tl;dr somewhere deep in the guts of the cli package (called to generate the pretty-printed output about column types), the code is generating a random string to use as a label.

A major clue is that

set.seed(123); dat <- read_csv("iris.csv", col_types=cols()); rnorm(1)

runs read_csv guessing the column types but without printing information about the guesses; this doesn't hit the RNG, which makes me think it's something in the fancy colour printing.

By making a copy of the random seed info (R <- .Random.seed) and stepping through the code (debug(readr::show_cols_spec)) and periodically running identical(R, .Random.seed) to check on the status, I found that the random info changes after running

cli::cli_h1("Column specification")

Debugging into that function, the change occurs somewhere in cli::cli__message; specifically, right before we execute this line

 if ("id" %in% names(args) && is.null(args$id)) args$id <- new_uuid()

(which is here in the source code of cli), identical(R, .Random.seed) is still TRUE; after running it, it's FALSE. More specifically, all we have to do at this point is evaluate the args argument (e.g. by typing args in the debugger).

Working our way back up the chain and trying things by hand, we can see that manually evaluating

glue_cmd(text, .envir = .envir)

at this point in the code changes the random info.

Still more stepping through takes us to a point within glue_cmd where we call make_cmd_transformer where at this point we call a function called random_id():

values$marker <- random_id()

random_id() then calls sample ...

I have no idea why this internal bit of cli needs to be generating a random string, but I guess you could ask the maintainers?

This was done using readr 1.4.0 and cli 2.5.0 (although the code references are to the current version on GitHub [10 June 2021]).

answered Sep 22 '22 08:09

Ben Bolker

Related questions
                            
                                How to stop tidyr spread sorting columns alphabetically
                            
                                Shiny local deployment error : input string 1 is invalid UTF-8
                            
                                how to convert table() to matrix in r
                            
                                OLS with both panel-corrected standard errors and AR(1) correction in R
                            
                                How can I maintain consistent box width in a boxplot where factor*group combination has no observations?
                            
                                Setting row names on a tibble is deprecated. Error: invalid 'row.names' length
                            
                                R package: writing internal data, but not all at once
                            
                                How to configure the curl package in R with default web proxy settings?
                            
                                Compiled R code is actually slower than pure R with JIT enabled
                            
                                How to compute the Topological Overlap Measure [TOM] for a weighted adjacency matrix in Python?
                            
                                floating TOC for prettydoc in Rmarkdown ask for theme
                            
                                Create a questionnaire with R Shiny
                            
                                How to profile the loading of an R package
                            
                                sf: How to get back to MULTIPOLYGON from GEOMETRYCOLLECTION?
                            
                                How to merge two lists based on object indices - keeping attributes?
                            
                                How to run for loop in debug mode within RStudio?
                            
                                How to avoid the connection lines in geom_line or geom_path when there is no data?
                            
                                How can I add a logo to a ggplot visualisation?
                            
                                Do we talk about reference type and primitive type in R?
                            
                                What can a data frame do that a tibble cannot?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With