Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table fread - read column as Date

Tags:

r

data.table

I would like to read a file with fread from data.table that has a column with "YYYY-MM-DD" format dates. By default, fread reads the column as chr. However, I would like to have the column as Date, the same way I would obtain when applying as.Date.

I have tried to use

dt[,starttime.date := as.Date(starttime.date)]

but it takes very long to run (I have approx. 43 million rows).

like image 754
paljenczy Avatar asked Mar 19 '15 09:03

paljenczy


People also ask

Is fread faster than read CSV?

For files beyond 100 MB in size fread() and read_csv() can be expected to be around 5 times faster than read. csv() .

How do I convert a date in R?

You can use the as. Date( ) function to convert character data to dates. The format is as. Date(x, "format"), where x is the character data and format gives the appropriate format.

How does fread work in R?

The fread() function returns the number of full items successfully read, which can be less than count if an error occurs, or if the end-of-file is met before reaching count. If size or count is 0, the fread() function returns zero, and the contents of the array and the state of the stream remain unchanged.

What package is data table in R?

Data. table is an extension of data. frame package in R. It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader.


1 Answers

Using the fasttime package, as suggested in the fread documentation, is approximately 100x faster than as.Date or as.IDate:

library(data.table)
library(fasttime)

dt[,starttime.date := fastPOSIXct(starttime.date)]

Benchmark results:

library(microbenchmark)
library(fasttime)
DT <- data.table(start_date = paste(sample(1900:2018, 100000, replace = T), 
                                    sample(1:12, 100000, replace = T),
                                    sample(1:28, 100000, replace = T),
                                    sep = "-"))
microbenchmark(
  as.Date(DT$start_date),
  as.IDate(DT$start_date),
  fastPOSIXct(DT$start_date)
)

> Unit: milliseconds
>                        expr    mean 
>      as.Date(DT$start_date)  383.89
>     as.IDate(DT$start_date)  405.89
>  fastPOSIXct(DT$start_date)    4.59 
like image 76
Mark Egge Avatar answered Oct 02 '22 10:10

Mark Egge