Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect TIME when reading from an excel sheet using R

Tags:

r

openxlsx

The issue is that when I read from an excel sheet into R using read.xlsx from openxlsx package, the TIME column is converted into a fraction.

Here is an example,

dfin <-
DATE          TIME
15/02/2015    8:00 AM
22/01/2014    10:00 PM

library(openxlsx)
test <-  read.xlsx("dfin.xlsx", sheet = 1,
                 detectDates=TRUE, skipEmptyRows = TRUE,
                 skipEmptyCols = TRUE, rows = NULL, cols = NULL, check.names = FALSE,
                 namedRegion = NULL, na.strings = "NA", fillMergedCells = FALSE) 

Output:

  DATE        TIME
  2015-02-15  0.3333333
  2014-01-22  0.9166667

I am not sure why it does that and whether there is a way to fix that as I need to use both DATE and TIME to do some calculations.

like image 657
Amer Avatar asked Feb 01 '19 06:02

Amer


2 Answers

R doesn't have a time format really, so I suggest reading it in using read_excel, which automatically detects the column type. This will turn it into a date-time format with a random date, which you can then remove, before converting it into a proper timestamp.

library(readxl)
library(lubridate)

test <- read_excel('dfin.xlsx',trim_ws = TRUE) %>%
  #return the TIME column to the way it is written in Excel
  mutate(TIME = as.character(gsub(".* ","",TIME)),
  #format the date column
     DATE = dmy(DATE),
  #turn it into a timestamp
     TIMESTAMP = as.POSIXct(paste(DATE,TIME)))
like image 56
Randall Helms Avatar answered Sep 19 '22 11:09

Randall Helms


My first guess was that read.xlsx() is trying to guess the date-looking columns in .xlsx while it reads the file, and weirdly converts the time from %I:%M %p format into fractions of 24 hours (because e.g. 0.3333333 * 24 = 7.999999 which is exactly 8.0). But latter I noticed that if I change the parameter detectDates into FALSE nothing really changes - it outputs the same data frame. So it guess nothing, it just reads the TIME as it is.

If you try to edit 10:00 PM within Excel workbook, you'll see that it is really stored as 22:00:00. So why at the end it is represented as a fraction of 24?! I don't know, and I hope someone can explain that.

@Randall approach is really good alternative comparing to openxlsx::read.xlsx(). Note that read_xlsx() recognizes TIME as %H:%M:%S, and converts it into the dummy POSIXct/POSIXt object, i.e. 1899-12-31 08:00:00 and 1899-12-31 22:00:00.

Surprisingly, read_xlsx() doesn't recognize that DATE has %d-%m-%Y format, and interpret it as a character. Meaning that we need to convert both variables into appropriate format in order to obtain desired output.

I don't think we need to use gsub to get the 12-hour clock time from POSIXct object, it is much easire to use format for this purpose. And conversion of DATE from %d-%m-%Y into %Y-%m-%d format is even an easier task:

library(dplyr)
library(readxl)

read_xlsx("myfile.xlsx") |>
  mutate(
    DATE = as.Date(DATE, "%d/%m/%Y"), 
    TIME = format(TIME, "%I:%M %p")   # “That’s what I do: I drink and I know things.”
 )

Which produces:

# A tibble: 2 x 2
  DATE       TIME    
  <date>     <chr>   
1 2015-02-15 08:00 AM
2 2014-01-22 10:00 PM
like image 41
utubun Avatar answered Sep 19 '22 11:09

utubun