I am reading from a huge text file that has '%d/%m/%Y'
date format. I want to use read.csv.sql of sqldf to read and filter the data by date at the same time. This is to save memory usage and run time by skipping many dates that I am not interested in. I know how to do this with the help of dplyr
and lubridate
, but I just want to try with sqldf
for the aforementioned reason. Even though I am quite familiar with SQL syntax, it still gets me most of the time, no exception with sqldf
.
Running command like following returned a data.frame with 0 row:
first_date <- "2001-11-1"
second_date <- "2003-11-1"
query <- "select * from file WHERE strftime('%d/%m/%Y', Date, 'unixepoch', 'localtime') between
'$first_date' AND '$second_date'"
df <- read.csv.sql(data_file,
sql= query,
stringsAsFactors=FALSE,
sep = ";", header = TRUE)
So for simulation, I tried with sqldf
function like the following:
first_date <- "2001-11-1"
second_date <- "2003-11-1"
df2 <- data.frame( Date = paste(rep(1:3, each = 4), 11:12, 2001:2012, sep = "/"))
sqldf("SELECT * FROM df2 WHERE strftime('%d/%m/%Y', Date, 'unixepoch') BETWEEN '$first-date' AND '$second_date' ")
# Expect:
# Date
# 1 1-11-2001
# 2 1-12-2002
# 3 1-11-2003
strftime strftime
with percent codes is used to convert an object already regarded by sqlite as a datetime to something else but you want the reverse so the approach in the question is not going to work. For example, here we convert the current time into a dd-mm-yyyy string:
library(sqldf)
sqldf("select strftime('%d-%m-%Y', 'now') now")
## now
## 1 07-09-2014
Discussion Since SQlite lacks date types its a bit onerous to handle this, particularly with the 1-or-2-digit non-standard date formats, but if you really want to use SQLite we can do it by tediously parsing out the date strings. Using fn$
from the gsubfn package for string interpolation eases this a little.
Code Below zero2d
outputs SQL code to prepend a zero character to its input if its one digit. rmSlash
outputs SQL code to remove any slashes in its argument. Year
, Month
and Day
each output SQL code to take a character string representing a date in the format under discussion and extract the indicated component reformatting it as a 2 digit zero filled character string in the case of Month
and Day
. fmtDate
takes a character string of the form shown in the question for first_string
and second_string
and outputs a yyyy-mm-dd
character string.
library(sqldf)
library(gsubfn)
zero2d <- function(x) sprintf("substr('0' || %s, -2)", x)
rmSlash <- function(x) sprintf("replace(%s, '/', '')", x)
Year <- function(x) sprintf("substr(%s, -4)", x)
Month <- function(x) {
y <- sprintf("substr(%s, instr(%s, '/') + 1, 2)", x, x)
zero2d(rmSlash(y))
}
Day <- function(x) {
y <- sprintf("substr(%s, 1, 2)", x)
zero2d(rmSlash(y))
}
fmtDate <- function(x) format(as.Date(x))
sql <- "select * from df2 where
`Year('Date')` || '-' ||
`Month('Date')` || '-' ||
`Day('Date')`
between '`fmtDate(first_date)`' and '`fmtDate(second_date)`'"
fn$sqldf(sql)
giving:
Date
1 1/11/2001
2 1/12/2002
3 1/11/2003
Notes
1) SQLite functions used instr
, replace
and substr
are core sqlite functions
2) SQL The actual SQL statement that is executed after fn$
performs the substitutions is as follows (slightly reformatted to fit):
> cat( fn$identity(sql), "\n")
select * from df2 where
substr(Date, -4)
|| '-' ||
substr('0' || replace(substr(Date, instr(Date, '/') + 1, 2), '/', ''), -2)
|| '-' ||
substr('0' || replace(substr(Date, 1, 2), '/', ''), -2)
between '2001-11-01' and '2003-11-01'
3) source of complications the main complication is the non-standard 1-or-2 digit day and month. Had they been consistently 2 digits it would have reduced to this:
first_date <- "2001-11-01"
second_date <- ""2003-11-01"
fn$sqldf("select Date from df2
where substr(Date, -4) || '-' ||
substr(Date, 4, 2) || '-' ||
substr(Date, 1, 2)
between '`first_date`' and '`second_date`' ")
4) H2 Here is an H2 solution. H2 does have a datetime type simplifying the solution substantially over SQLite. We assume that the data is in a file called mydata.dat
. Note that read.csv.sql
does not support H2 as H2 already has the internal csvread
SQL function to do that:
library(RH2)
library(sqldf)
first_date <- "2001-11-01"
second_date <- "2003-11-01"
fn$sqldf(c("CREATE TABLE t(DATE TIMESTAMP) AS
SELECT parsedatetime(DATE, 'd/M/y') as DATE
FROM CSVREAD('mydata.dat')",
"SELECT DATE FROM t WHERE DATE between '`first_date`' and '`second_date`'"))
Note that the first RH2 query will be slow in a session since it loads java. After that you can try it out to see if the performance is adequate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With