Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does read_csv() use random numbers for?

Tags:

r

readr

I just noticed that read_csv() somehow uses random numbers which is unexpected (at least to me). The corresponding base R function read.csv() does not do that. So, what does read_csv() use the random numbers for? I looked into the documentation but could not find a clear answer to that. Are the random numbers related to the guess_max argument?

library(tidyverse)
set.seed(123)
rnorm(1)
# [1] -0.5604756

set.seed(123)
dat <- read.csv("data/titanic.csv")
rnorm(1)
# [1] -0.5604756

set.seed(123)
dat <- read_csv("data/titanic.csv")
rnorm(1)
#[1] 1.239496

EDIT:

  1. As suggested by rawr's comment, I tried specifying col_types and indeed it worked. But still I wonder why this is happening. Anyone got an explanation?
set.seed(123)
dat <- read_csv("data/titanic.csv", col_types = c("dddccdddcdcc"))
rnorm(1)
#[1] -0.5604756
  1. Since a lot of people asked about the R and readr version, here is my session info.
library(readr)
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.5 (2021-03-31)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  German_Germany.1252         
#>  ctype    German_Germany.1252         
#>  tz       Europe/Berlin               
#>  date     2021-06-10                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version     date       lib source                            
#>  cli           2.5.0       2021-04-26 [1] CRAN (R 4.0.3)                    
#>  crayon        1.4.1       2021-02-08 [1] CRAN (R 4.0.4)                    
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 4.0.3)                    
#>  ellipsis      0.3.2       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 4.0.3)                    
#>  fansi         0.5.0       2021-05-25 [1] CRAN (R 4.0.5)                    
#>  fastmap       1.1.0       2021-01-25 [1] CRAN (R 4.0.5)                    
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 4.0.3)                    
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 4.0.3)                    
#>  highr         0.9         2021-04-16 [1] CRAN (R 4.0.5)                    
#>  hms           1.0.0       2021-01-13 [1] CRAN (R 4.0.5)                    
#>  htmltools     0.5.1.9003  2021-05-07 [1] Github (rstudio/htmltools@e12171e)
#>  knitr         1.33        2021-04-24 [1] CRAN (R 4.0.5)                    
#>  lifecycle     1.0.0       2021-02-15 [1] CRAN (R 4.0.4)                    
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 4.0.3)                    
#>  pillar        1.6.1       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.0.3)                    
#>  ps            1.6.0       2021-02-28 [1] CRAN (R 4.0.5)                    
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 4.0.3)                    
#>  readr       * 1.4.0       2020-10-05 [1] CRAN (R 4.0.5)                    
#>  reprex        2.0.0       2021-04-02 [1] CRAN (R 4.0.5)                    
#>  rlang         0.4.11.9000 2021-05-29 [1] Github (r-lib/rlang@7797cdf)      
#>  rmarkdown     2.8.1       2021-05-07 [1] Github (rstudio/rmarkdown@e98207f)
#>  rstudioapi    0.13        2020-11-12 [1] CRAN (R 4.0.3)                    
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 4.0.3)                    
#>  stringi       1.5.3       2020-09-09 [1] CRAN (R 4.0.3)                    
#>  stringr       1.4.0       2019-02-10 [1] CRAN (R 4.0.3)                    
#>  tibble        3.1.2       2021-05-16 [1] CRAN (R 4.0.5)                    
#>  utf8          1.2.1       2021-03-12 [1] CRAN (R 4.0.3)                    
#>  vctrs         0.3.8       2021-04-29 [1] CRAN (R 4.0.3)                    
#>  withr         2.4.2       2021-04-18 [1] CRAN (R 4.0.5)                    
#>  xfun          0.22        2021-03-11 [1] CRAN (R 4.0.5)                    
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 4.0.3)                    
#> 
#> [1] C:/Users/Albert/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.5/library

Created on 2021-06-10 by the reprex package (v2.0.0)

like image 386
AlbertRapp Avatar asked Jun 09 '21 17:06

AlbertRapp


People also ask

What is the difference between read CSV and read_csv?

The read_csv function imports data into R as a tibble, while read. csv imports a regular old R data frame instead.

How does the read_csv function determine column types when reading in a dataset?

When you run read_csv() it prints out a column specification that gives the name and type of each column. That's an important part of readr, which we'll come back to in parsing a file. In both cases read_csv() uses the first line of the data for the column names, which is a very common convention.

How to read Numbers from each row in a CSV file?

To read the numbers from each row, we make use of the reader object from CSV library and store all the rows within a list ‘output’, which we would also print afterward. And this how the above input gets stored within ‘my_csv.csv’ file:

How to read a CSV file using PANDAS read_CSV?

The pandas read_csv function can be used in different ways as per necessity like using custom separators, reading only selective columns/rows and so on. All cases are covered below one after another. To read a CSV file, call the pandas function read_csv () and pass the file path as input. By default, a CSV is seperated by comma.

How do I read a CSV file in R?

read.csv creates a regular data frame. you should load a tibble instead of a data frame if you’re a data scientist with better things to do other than wait for your data to load into R. Before you can use the read_csv function, you have to load readr, the R package that houses read_csv. You have two options to do so.

What is the difference between read_CSV and read_CSV in R?

The read_csv function imports data into R as a tibble, while read.csv imports a regular old R data frame instead. Tibbles are better than regular data frames because they: allow non-standard variable names (i.e. your variables can start with a number and can contain spaces)


1 Answers

tl;dr somewhere deep in the guts of the cli package (called to generate the pretty-printed output about column types), the code is generating a random string to use as a label.


A major clue is that

set.seed(123); dat <- read_csv("iris.csv", col_types=cols()); rnorm(1)

runs read_csv guessing the column types but without printing information about the guesses; this doesn't hit the RNG, which makes me think it's something in the fancy colour printing.

By making a copy of the random seed info (R <- .Random.seed) and stepping through the code (debug(readr::show_cols_spec)) and periodically running identical(R, .Random.seed) to check on the status, I found that the random info changes after running

cli::cli_h1("Column specification")

Debugging into that function, the change occurs somewhere in cli::cli__message; specifically, right before we execute this line

 if ("id" %in% names(args) && is.null(args$id)) args$id <- new_uuid()

(which is here in the source code of cli), identical(R, .Random.seed) is still TRUE; after running it, it's FALSE. More specifically, all we have to do at this point is evaluate the args argument (e.g. by typing args in the debugger).

Working our way back up the chain and trying things by hand, we can see that manually evaluating

glue_cmd(text, .envir = .envir)

at this point in the code changes the random info.

Still more stepping through takes us to a point within glue_cmd where we call make_cmd_transformer where at this point we call a function called random_id():

values$marker <- random_id()

random_id() then calls sample ...

I have no idea why this internal bit of cli needs to be generating a random string, but I guess you could ask the maintainers?


This was done using readr 1.4.0 and cli 2.5.0 (although the code references are to the current version on GitHub [10 June 2021]).

like image 53
Ben Bolker Avatar answered Sep 22 '22 08:09

Ben Bolker