The issue is that when I read from an excel sheet into R using read.xlsx
from openxlsx
package, the TIME
column is converted into a fraction.
Here is an example,
dfin <-
DATE TIME
15/02/2015 8:00 AM
22/01/2014 10:00 PM
library(openxlsx)
test <- read.xlsx("dfin.xlsx", sheet = 1,
detectDates=TRUE, skipEmptyRows = TRUE,
skipEmptyCols = TRUE, rows = NULL, cols = NULL, check.names = FALSE,
namedRegion = NULL, na.strings = "NA", fillMergedCells = FALSE)
Output:
DATE TIME
2015-02-15 0.3333333
2014-01-22 0.9166667
I am not sure why it does that and whether there is a way to fix that as I need to use both DATE and TIME to do some calculations.
R doesn't have a time format really, so I suggest reading it in using read_excel
, which automatically detects the column type. This will turn it into a date-time format with a random date, which you can then remove, before converting it into a proper timestamp.
library(readxl)
library(lubridate)
test <- read_excel('dfin.xlsx',trim_ws = TRUE) %>%
#return the TIME column to the way it is written in Excel
mutate(TIME = as.character(gsub(".* ","",TIME)),
#format the date column
DATE = dmy(DATE),
#turn it into a timestamp
TIMESTAMP = as.POSIXct(paste(DATE,TIME)))
My first guess was that read.xlsx()
is trying to guess the date-looking columns in .xlsx while it reads the file, and weirdly converts the time from %I:%M %p
format into fractions of 24 hours (because e.g. 0.3333333 * 24 = 7.999999
which is exactly 8.0). But latter I noticed that if I change the parameter detectDates
into FALSE
nothing really changes - it outputs the same data frame. So it guess nothing, it just reads the TIME
as it is.
If you try to edit 10:00 PM
within Excel workbook, you'll see that it is really stored as 22:00:00
. So why at the end it is represented as a fraction of 24
?! I don't know, and I hope someone can explain that.
@Randall approach is really good alternative comparing to openxlsx::read.xlsx()
. Note that read_xlsx()
recognizes TIME
as %H:%M:%S
, and converts it into the dummy POSIXct/POSIXt
object, i.e. 1899-12-31 08:00:00
and 1899-12-31 22:00:00
.
Surprisingly, read_xlsx()
doesn't recognize that DATE
has %d-%m-%Y
format, and interpret it as a character
. Meaning that we need to convert both variables into appropriate format in order to obtain desired output.
I don't think we need to use gsub
to get the 12-hour clock time from POSIXct
object, it is much easire to use format
for this purpose. And conversion of DATE
from %d-%m-%Y
into %Y-%m-%d
format is even an easier task:
library(dplyr)
library(readxl)
read_xlsx("myfile.xlsx") |>
mutate(
DATE = as.Date(DATE, "%d/%m/%Y"),
TIME = format(TIME, "%I:%M %p") # “That’s what I do: I drink and I know things.”
)
Which produces:
# A tibble: 2 x 2
DATE TIME
<date> <chr>
1 2015-02-15 08:00 AM
2 2014-01-22 10:00 PM
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With