Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: data.table. How to save dates properly with fwrite?

I have a dataset. I can choose to load it on R from a Stata file or from a SPSS file.
In both cases it's loaded properly with the haven package. The dates are recognized properly.

But when I save it to disk with data.table's fwrite function.
fwrite(ppp, "ppp.csv", sep=",", col.names = TRUE)

I have a problem, the dates dissapear and are converted to different numbers. For example, the date 1967-08-06 is saved in the csv file as -879

I've also tried playing with fwrite options, such as quote=FALSE, with no success.

I've uploaded a small sample of the files, the spss, the stata and the saved csv.

and this is the code, in order to do things easier for you.

library(haven)
library(data.table)
ppp <- read_sav("pspss.sav") # choose one of these two.
ppp <- read_dta("pstata.dta")  # choose one of these two.
fwrite(ppp, "ppp.csv",  sep=",", col.names = TRUE) 

The real whole table has more than one thousand variables and one million individuals. That's why I would like to use a fast way to do things.

http://www73.zippyshare.com/v/OwzwbyQq/file.html

This is for @ArtificialBreeze:

> head(my)
# A tibble: 6 x 9
  ID_2006_2011              TIS FECHA_NAC_2006    año2006 Edad_31_12_2006 SEXO_2006
         <dbl>            <chr>         <date>     <date>           <dbl>     <chr>
1  1.60701e+11 BBNR670806504015     1967-08-06 2006-12-31              39         M
2  1.60701e+11 BCBD580954916014     1958-09-14 2006-12-31              48         F
3  1.60701e+11 BCBL451245916015     1945-12-05 2006-12-31              61         F
4  1.60701e+11 BCGR610904916012     1961-09-04 2006-12-31              45         M
5  1.60701e+11 BCMR580148916015     1958-01-08 2006-12-31              48         F
6  1.60701e+11 BCMX530356917018     1953-03-16 2006-12-31              53         F
# ... with 3 more variables: PAIS_NAC_2006 <dbl>, FECHA_ALTA_TIS_2006 <date>,
#   FECHA_ALTA_TIS_2006n <date>
like image 646
skan Avatar asked Jul 14 '16 16:07

skan


2 Answers

Since this question was asked 6 months ago, fwrite has improved and been released to CRAN. I believe it should work as you wanted now; i.e. fast, direct and convenient date formatting. It now has the dateTimeAs argument as follows, copied from fwrite's manual page for v1.10.0 as on CRAN now. As time progresses, please check the latest version of the manual page.

====

dateTimeAs : How Date/IDate, ITime and POSIXct items are written.

  • "ISO" (default) - 2016-09-12, 18:12:16 and 2016-09-12T18:12:16.999999Z. 0, 3 or 6 digits of fractional seconds are printed if and when present for convenience, regardless of any R options such as digits.secs. The idea being that if milli and microseconds are present then you most likely want to retain them. R's internal UTC representation is written faithfully to encourage ISO standards, stymie timezone ambiguity and for speed. An option to consider is to start R in the UTC timezone simply with "$ TZ='UTC' R" at the shell (NB: it must be one or more spaces between TZ='UTC' and R, anything else will be silently ignored; this TZ setting applies just to that R process) or Sys.setenv(TZ='UTC') at the R prompt and then continue as if UTC were local time.

  • "squash" - 20160912, 181216 and 20160912181216999. This option allows fast and simple extraction of yyyy, mm, dd and (most commonly to group by) yyyymm parts using integer div and mod operations. In R for example, one line helper functions could use %/%10000, %/%100%%100, %%100 and %/%100 respectively. POSIXct UTC is squashed to 17 digits (including 3 digits of milliseconds always, even if 000) which may be read comfortably as integer64 (automatically by fread()).

  • "epoch" - 17056, 65536 and 1473703936.999999. The underlying number of days or seconds since the relevant epoch (1970-01-01, 00:00:00 and 1970-01-01T00:00:00Z respectively), negative before that (see ?Date). 0, 3 or 6 digits of fractional seconds are printed if and when present.

  • "write.csv" - this currently affects POSIXct only. It is written as write.csv does by using the as.character method which heeds digits.secs and converts from R's internal UTC representation back to local time (or the "tzone" attribute) as of that historical date. Accordingly this can be slow. All other column types (including Date, IDate and ITime which are independent of timezone) are written as the "ISO" option using fast C code which is already consistent with write.csv.

The first three options are fast due to new specialized C code. The epoch to date-part conversion uses a fast approach by Howard Hinnant (see references) using a day-of-year starting on 1 March. You should not be able to notice any difference in write speed between those three options. The date range supported for Date and IDate is [0000-03-01, 9999-12-31]. Every one of these 3,652,365 dates have been tested and compared to base R including all 2,790 leap days in this range. This option applies to vectors of date/time in list column cells, too. A fully flexible format string (such as "%m/%d/%Y") is not supported. This is to encourage use of ISO standards and because that flexibility is not known how to make fast at C level. We may be able to support one or two more specific options if required.

====

like image 101
Matt Dowle Avatar answered Oct 10 '22 07:10

Matt Dowle


I had the same problem, and I just changed the date column to as.character before writing, and then changed it back to as.Date after reading. I don't know how it influences read and write times, but it was a good enough solution for me.

like image 37
Krimonk Avatar answered Oct 10 '22 06:10

Krimonk