automatically detect date columns when reading a file into a data.frame

Tags:

When reading a file, the read.table function uses type.convert to distinguish between logical, integer, numeric, complex, or factor columns and store them accordingly.

I'd like to add dates to the mix, so that columns containing dates can automatically be recognized and parsed into Date objects. Only a few date formats should be recognized, e.g.

date.formats <- c("%m/%d/%Y", "%Y/%m/%d")

Here is an example:

fh <- textConnection(

 "num  char date-format1  date-format2  not-all-dates  not-same-formats
   10     a     1/1/2013    2013/01/01     2013/01/01          1/1/2013
   20     b     2/1/2013    2013/02/01              a        2013/02/01 
   30     c     3/1/2013            NA              b          3/1/2013"
)

And the output of

dat <- my.read.table(fh, header = TRUE, stringsAsFactors = FALSE,
                     date.formats = date.formats)
sapply(dat, class)

would give:

num              => numeric
char             => character
date-format1     => Date
date-format2     => Date
not-all-dates    => character
not-same-formats => character   # not a typo: date format must be consistent

Before I go and implement it from scratch, is something like this already available in a package? Or maybe someone already gave it a crack (or will) and is willing to share his code here? Thank you.

745

asked Aug 22 '13 20:08

flodel

2 Answers

You could use lubridate::parse_date_time, which is a bit stricter (and creates POSIXlt) data.

I've also added a bit more checking for existing NA values (may not be necessary).

library(lubridate)
my.read.table <- function(..., date.formats = c("%m/%d/%Y", "%Y/%m/%d")) {
  dat <- read.table(...)
  for (col.idx in seq_len(ncol(dat))) {
    x <- dat[, col.idx]
    if(!is.character(x) | is.factor(x)) next
    if (all(is.na(x))) next
    for (format in date.formats) {
      complete.x <- !(is.na(x))
      d <- as.Date(parse_date_time(as.character(x), format, quiet = TRUE))
      d.na <- d[complete.x]
      if (any(is.na(d.na))) next
      dat[, col.idx] <- d         
    }
  }
  dat

}

 dat <- my.read.table(fh, stringsAsFactors = FALSE,header=TRUE)

str(dat)
'data.frame':   3 obs. of  6 variables:
 $ num             : int  10 20 30
 $ char            : chr  "a" "b" "c"
 $ date.format1    : Date, format: "2013-01-01" "2013-02-01" "2013-03-01"
 $ date.format2    : Date, format: "2013-01-01" "2013-02-01" NA
 $ not.all.dates   : chr  "2013/01/01" "a" "b"
 $ not.same.formats: chr  "1/1/2013" "2013/02/01" "3/1/2013"

An alternative would be to use options(warn = 2) within the function and wrap the parse_date_time(...) in a try statement

my.read.table <- function(..., date.formats = c("%m/%d/%Y", "%Y/%m/%d")) {
  dat <- read.table(...)
  owarn <-getOption('warn')
  on.exit(options(warn = owarn))
  options(warn = 2)
  for (col.idx in seq_len(ncol(dat))) {
    x <- dat[, col.idx]
    if(!is.character(x) | is.factor(x)) next
    if (all(is.na(x))) next
    for (format in date.formats) {
      d <- try(as.Date(parse_date_time(as.character(x), format)), silent= TRUE)

      if (inherits(d, 'try-error')) next
      dat[, col.idx] <- d         
    }
  }
  dat

}

152

answered Oct 18 '22 19:10

mnel

You can try with regular expressions.

my.read.table <- function(..., date.formats = c("%m/%d/%Y", "%Y/%m/%d")) {
   require(stringr)
   formats <- c(
     "%m" = "[0-9]{1,2}",
     "%d" = "[0-9]{1,2}",
     "%Y" = "[0-9]{4}"
   )
   dat <- read.table(...)
   for (col.idx in seq_len(ncol(dat))) {
      for (format in date.formats) {
         x <- dat[, col.idx]
         if(!is.character(x) | is.factor(x)) break
         if (all(is.na(x))) break
         x <- as.character(x)
         # Convert the format into a regular expression
         for( k in names(formats) ) {
           format <- str_replace_all( format, k, formats[k] )
         }
         # Check if it matches on the non-NA elements
         if( all( str_detect( x, format ) | is.na(x) ) ) {
           dat[, col.idx] <- as.Date(x, format)
           break
         }
      }
   }
   dat
}

dat <- my.read.table(fh, header = TRUE, stringsAsFactors = FALSE)
as.data.frame(sapply(dat, class))
#                  sapply(dat, class)
# num                         integer
# char                      character
# date.format1                   Date
# date.format2                   Date
# not.all.dates             character
# not.same.formats          character

answered Oct 18 '22 18:10

Vincent Zoonekynd

Related questions
                            
                                base R faster than readr for reading multiple CSV files
                            
                                Rounding off values in the Kable
                            
                                mclapply with big objects - "serialization is too large to store in a raw vector"
                            
                                How to sample large database and implement K-means and K-nn in R?
                            
                                Publishing from R+knitr to WordPress?
                            
                                Error Objects in \usage without \alias in documentation object from R CMD Check
                            
                                R clients to OLAP MDX servers
                            
                                Error deleting factor column in empty data.table
                            
                                Intersect dataframe on multiple columns [duplicate]
                            
                                Using R's GPU packages on Amazon
                            
                                Boxplot width in ggplot with cross classified groups
                            
                                Is there a way to call the `[<-` function in `[` form?
                            
                                mutate() is trying to extract using the value of a global variable when using the dollar sign operator
                            
                                R circlize: Error in circos.initialize
                            
                                Problems installing r package via devtools install_github
                            
                                World map showing day and night regions
                            
                                Gitbook chapter bibliography not in alphabetical order
                            
                                Partitioning data on a variable to speed up "fuzzy match" using stringdist
                            
                                Machine learning project: split training/test sets before or after exploratory data analysis?
                            
                                extract variables in formula from a data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

automatically detect date columns when reading a file into a data.frame

Tags:

date

r

read.table

flodel

People also ask

2 Answers

mnel

Vincent Zoonekynd

Recent Activity

Donate For Us