I just noticed that read_csv()
somehow uses random numbers which is unexpected (at least to me). The corresponding base R function read.csv()
does not do that. So, what does read_csv()
use the random numbers for? I looked into the documentation but could not find a clear answer to that. Are the random numbers related to the guess_max
argument?
library(tidyverse)
set.seed(123)
rnorm(1)
# [1] -0.5604756
set.seed(123)
dat <- read.csv("data/titanic.csv")
rnorm(1)
# [1] -0.5604756
set.seed(123)
dat <- read_csv("data/titanic.csv")
rnorm(1)
#[1] 1.239496
EDIT:
col_types
and indeed it worked. But still I wonder why this is happening. Anyone got an explanation?set.seed(123)
dat <- read_csv("data/titanic.csv", col_types = c("dddccdddcdcc"))
rnorm(1)
#[1] -0.5604756
readr
version, here is my session info.library(readr)
sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.5 (2021-03-31)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> ctype German_Germany.1252
#> tz Europe/Berlin
#> date 2021-06-10
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.3)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.4)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.3)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.3)
#> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.5)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.3)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.0.5)
#> hms 1.0.0 2021-01-13 [1] CRAN (R 4.0.5)
#> htmltools 0.5.1.9003 2021-05-07 [1] Github (rstudio/htmltools@e12171e)
#> knitr 1.33 2021-04-24 [1] CRAN (R 4.0.5)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.0.5)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.3)
#> ps 1.6.0 2021-02-28 [1] CRAN (R 4.0.5)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> readr * 1.4.0 2020-10-05 [1] CRAN (R 4.0.5)
#> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5)
#> rlang 0.4.11.9000 2021-05-29 [1] Github (r-lib/rlang@7797cdf)
#> rmarkdown 2.8.1 2021-05-07 [1] Github (rstudio/rmarkdown@e98207f)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.3)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.3)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.3)
#> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.0.5)
#> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.3)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.3)
#> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5)
#> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.5)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.3)
#>
#> [1] C:/Users/Albert/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.5/library
Created on 2021-06-10 by the reprex package (v2.0.0)
The read_csv function imports data into R as a tibble, while read. csv imports a regular old R data frame instead.
When you run read_csv() it prints out a column specification that gives the name and type of each column. That's an important part of readr, which we'll come back to in parsing a file. In both cases read_csv() uses the first line of the data for the column names, which is a very common convention.
To read the numbers from each row, we make use of the reader object from CSV library and store all the rows within a list ‘output’, which we would also print afterward. And this how the above input gets stored within ‘my_csv.csv’ file:
The pandas read_csv function can be used in different ways as per necessity like using custom separators, reading only selective columns/rows and so on. All cases are covered below one after another. To read a CSV file, call the pandas function read_csv () and pass the file path as input. By default, a CSV is seperated by comma.
read.csv creates a regular data frame. you should load a tibble instead of a data frame if you’re a data scientist with better things to do other than wait for your data to load into R. Before you can use the read_csv function, you have to load readr, the R package that houses read_csv. You have two options to do so.
The read_csv function imports data into R as a tibble, while read.csv imports a regular old R data frame instead. Tibbles are better than regular data frames because they: allow non-standard variable names (i.e. your variables can start with a number and can contain spaces)
tl;dr somewhere deep in the guts of the cli
package (called to generate the pretty-printed output about column types), the code is generating a random string to use as a label.
A major clue is that
set.seed(123); dat <- read_csv("iris.csv", col_types=cols()); rnorm(1)
runs read_csv
guessing the column types but without printing information about the guesses; this doesn't hit the RNG, which makes me think it's something in the fancy colour printing.
By making a copy of the random seed info (R <- .Random.seed
) and stepping through the code (debug(readr::show_cols_spec)
) and periodically running identical(R, .Random.seed)
to check on the status, I found that the random info changes after running
cli::cli_h1("Column specification")
Debugging into that function, the change occurs somewhere in cli::cli__message
; specifically, right before we execute this line
if ("id" %in% names(args) && is.null(args$id)) args$id <- new_uuid()
(which is here in the source code of cli
), identical(R, .Random.seed)
is still TRUE; after running it, it's FALSE. More specifically, all we have to do at this point is evaluate the args
argument (e.g. by typing args
in the debugger).
Working our way back up the chain and trying things by hand, we can see that manually evaluating
glue_cmd(text, .envir = .envir)
at this point in the code changes the random info.
Still more stepping through takes us to a point within glue_cmd
where we call make_cmd_transformer
where at this point we call a function called random_id()
:
values$marker <- random_id()
random_id()
then calls sample
...
I have no idea why this internal bit of cli
needs to be generating a random string, but I guess you could ask the maintainers?
This was done using readr
1.4.0 and cli
2.5.0 (although the code references are to the current version on GitHub [10 June 2021]).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With