The xlsx
package is reading dates in wrongly. I've read all the top similar Q's here and had a scout round the internet but I can't find this particular behaviour where the origin changes if there's non-date data in a column.
I have a tiny Excel spreadsheet you can get from dropbox:
https://www.dropbox.com/s/872q9mzb5uzukws/test.xlsx
It has three rows, two columns. First is a date, second is a number. The third row has "Grand Total" in the date column.
If I read in the first two rows with read.xlsx
and tell it the first column is a date then this works:
read.xlsx("./test.xlsx",head=FALSE,1,colClasses=c("Date","integer"),endRow=2)
X1 X2
1 2014-06-29 49
2 2014-06-30 46
Those are indeed the dates in the spreadsheet. If I try and read all three rows, something goes wrong:
read.xlsx("./test.xlsx",head=FALSE,1,colClasses=c("Date","integer"))
X1 X2
1 2084-06-30 49
2 2084-07-01 46
3 <NA> 89251
Warning message:
In as.POSIXlt.Date(x) : NAs introduced by coercion
If I try reading in as integers I get different integers:
> read.xlsx("./test.xlsx",head=FALSE,1,colClasses=c("integer","integer"),endRow=2)
X1 X2
1 16250 49
2 16251 46
> read.xlsx("./test.xlsx",head=FALSE,1,colClasses=c("integer","integer"))
X1 X2
1 41819 49
2 41820 46
3 NA 89251
The first integers are correctly converted using as.Date(s1$X1,origin="1970-01-01")
(Unix epoch) and the second integers are correctly converted using as.Date(s2$X1, origin="1899-12-30")
(Excel epoch). If I convert the second lot using 1970 I get the 2084 dates.
So: Am I doing something wrong? Is the best thing to read as integers, and if any NAs then convert using Excel epoch, otherwise use Unix epoch? Or is it a bug in the xlsx
package?
xlsx version is Version: 0.5.1
The dates can be read as integers and later converted to Date using openxlsx::convertToDate()
function.
More here
XLConnect
is able to handle this pretty sweet:
test <- readWorksheetFromFile( "~/Downloads/test.xlsx", sheet = "Sheet1", header = FALSE )
test
Col1 Col2
1 2014-06-29 00:00:00 49
2 2014-06-30 00:00:00 46
3 Grand Total 89251
The Problem you have is obvious, that the first column is of mixed type: character
and POSIXct
. XLConnect
is able to read each cell correctly in, but casts all cells of a column then to the most common type, which is character
in this case.
str(test)
'data.frame': 3 obs. of 2 variables:
$ Col1: chr "2014-06-29 00:00:00" "2014-06-30 00:00:00" "Grand Total"
$ Col2: num 49 46 89251
The problem you're having is that Excel stores the number of days since Jan-0-1900, and that is the number R is reading from the excel file. When you convert in R, you are converting based on the number of days since Jan-1-1970. If you subtract the number of days between those two first, it should work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With