I am experimenting with R to analyse some measurement data. I have a .csv file containing more than 2 million lines of measurement. Here is an example:
2014-10-22 21:07:03+00:00,7432442.0
2014-10-22 21:07:21+00:00,7432443.0
2014-10-22 21:07:39+00:00,7432444.0
2014-10-22 21:07:57+00:00,7432445.0
2014-10-22 21:08:15+00:00,7432446.0
2014-10-22 21:08:33+00:00,7432447.0
2014-10-22 21:08:52+00:00,7432448.0
2014-10-22 21:09:10+00:00,7432449.0
2014-10-22 21:09:28+00:00,7432450.0
After reading in the file, I want to convert the time to a correct time, using as.POSIXct()
. For small files this works fine, but for large files it does not.
I made an example by reading in a big file, creating a copy of a small portion and then unleashing the as.POSIXct()
on the correct column. I included an image of the file. As you can see, when applying it to the temp
-variable it does correctl keep the hours, minutes and seconds. However, when applying it to the whole file, only the date is stored. (it also takes a LOT of time (more than 2 minutes))
What could cause this anomality? Is it due to some system limits, since I'm running this on my laptop.
Edit
On my Windows 7 device I run R 3.1.3 which results in this error. However, on Ubuntu 14.01, running R 3.0.2, the times are kept for the large files. Just noticed there is a newer version (3.2.0) for Windows, will update and check if the issue persists.
as. POSIXct stores both a date and time with an associated time zone. The default time zone selected, is the time zone that your computer is set to which is most often your local time zone. POSIXct stores date and time in seconds with the number of seconds beginning at 1 January 1970.
The POSIXct class stores date/time values as the number of seconds since January 1, 1970, while the POSIXlt class stores them as a list with elements for second, minute, hour, day, month, and year, among others.
To format = , provide a character string (in quotes) that represents the current date format using the special “strptime” abbreviations below. For example, if your character dates are currently in the format “DD/MM/YYYY”, like “24/04/1968”, then you would use format = "%d/%m/%Y" to convert the values into dates.
Maybe the reason for your problem is that you have dates without time somewhere in your data set. Try the following example:
library(lubridate)
dates <- as.character(now() + minutes(1:5))
dates <- c(dates,"2015-05-10")
as.POSIXct(dates[1:5])
as.POSIXct(dates)
It first creates a vector dates
containing 6 dates with times and converts them to character. Then I add another date (as a character) that does not contain a time. When you run the two conversions to POSIXct
, you'll notice that the times are gone in the result, as soon as you include the date without time.
So there seems to be no date without time in the first few rows of your data, but later there maybe will be. There are most likely many solutions for this problem and I'll just propose one that came to my mind.
The first step is to change your read command, such that the dates are stored as characters instead of factors:
data <- read.csv("C:/RData/house2_electricity_Main.csv",header=FALSE,stringsAsFactors=FALSE)
Then you can try to add the time to all the dates that have none and convert to POSIXct only afterwards:
data$V1 <- ifelse(nchar(data$V1) > 11,data$V1, paste0(data$V1,"00:00:00"))
data$V1 <- as.POSIXct(data$V1)
This worked for my little example above. It is not the most elegant solution and maybe someone has a better idea.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With