Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sqldf: query data by range of dates

Tags:

r

sqldf

I am reading from a huge text file that has '%d/%m/%Y' date format. I want to use read.csv.sql of sqldf to read and filter the data by date at the same time. This is to save memory usage and run time by skipping many dates that I am not interested in. I know how to do this with the help of dplyr and lubridate, but I just want to try with sqldf for the aforementioned reason. Even though I am quite familiar with SQL syntax, it still gets me most of the time, no exception with sqldf.

Running command like following returned a data.frame with 0 row:

first_date <- "2001-11-1"
second_date <- "2003-11-1"
query <- "select * from file WHERE strftime('%d/%m/%Y', Date, 'unixepoch', 'localtime') between
                    '$first_date' AND '$second_date'"
df <- read.csv.sql(data_file,
                       sql= query,
                       stringsAsFactors=FALSE,
                       sep = ";", header = TRUE)

So for simulation, I tried with sqldf function like the following:

first_date <- "2001-11-1"
second_date <- "2003-11-1"
df2 <- data.frame( Date = paste(rep(1:3, each = 4), 11:12, 2001:2012, sep = "/"))
sqldf("SELECT * FROM df2 WHERE strftime('%d/%m/%Y', Date, 'unixepoch') BETWEEN '$first-date' AND '$second_date' ") 

# Expect:
# Date
# 1  1-11-2001
# 2  1-12-2002
# 3  1-11-2003
like image 958
biocyberman Avatar asked Sep 07 '14 19:09

biocyberman


1 Answers

strftime strftime with percent codes is used to convert an object already regarded by sqlite as a datetime to something else but you want the reverse so the approach in the question is not going to work. For example, here we convert the current time into a dd-mm-yyyy string:

library(sqldf)
sqldf("select strftime('%d-%m-%Y', 'now') now")
##          now
## 1 07-09-2014

Discussion Since SQlite lacks date types its a bit onerous to handle this, particularly with the 1-or-2-digit non-standard date formats, but if you really want to use SQLite we can do it by tediously parsing out the date strings. Using fn$ from the gsubfn package for string interpolation eases this a little.

Code Below zero2d outputs SQL code to prepend a zero character to its input if its one digit. rmSlash outputs SQL code to remove any slashes in its argument. Year, Month and Day each output SQL code to take a character string representing a date in the format under discussion and extract the indicated component reformatting it as a 2 digit zero filled character string in the case of Month and Day. fmtDate takes a character string of the form shown in the question for first_string and second_string and outputs a yyyy-mm-dd character string.

library(sqldf)
library(gsubfn)

zero2d <- function(x) sprintf("substr('0' || %s, -2)", x)

rmSlash <- function(x) sprintf("replace(%s, '/', '')", x)

Year <- function(x) sprintf("substr(%s, -4)", x)

Month <- function(x) {
   y <- sprintf("substr(%s, instr(%s, '/') + 1, 2)", x, x)
   zero2d(rmSlash(y))
}

Day <- function(x) {
   y <- sprintf("substr(%s, 1, 2)", x)
   zero2d(rmSlash(y))
}

fmtDate <- function(x) format(as.Date(x))

sql <- "select * from df2 where 
  `Year('Date')` || '-' || 
  `Month('Date')` || '-' || 
  `Day('Date')`
  between '`fmtDate(first_date)`' and '`fmtDate(second_date)`'"
fn$sqldf(sql)

giving:

       Date
1 1/11/2001
2 1/12/2002
3 1/11/2003

Notes

1) SQLite functions used instr, replace and substr are core sqlite functions

2) SQL The actual SQL statement that is executed after fn$ performs the substitutions is as follows (slightly reformatted to fit):

> cat( fn$identity(sql), "\n")
select * from df2 where 
  substr(Date, -4) 
  || '-' || 
  substr('0' || replace(substr(Date, instr(Date, '/') + 1, 2), '/', ''), -2) 
  || '-' || 
  substr('0' || replace(substr(Date, 1, 2), '/', ''), -2)
  between '2001-11-01' and '2003-11-01' 

3) source of complications the main complication is the non-standard 1-or-2 digit day and month. Had they been consistently 2 digits it would have reduced to this:

first_date <- "2001-11-01"
second_date <- ""2003-11-01"

fn$sqldf("select Date from df2 
   where substr(Date, -4) || '-' || 
         substr(Date, 4, 2) || '-' ||
         substr(Date, 1, 2)
   between '`first_date`' and '`second_date`' ")

4) H2 Here is an H2 solution. H2 does have a datetime type simplifying the solution substantially over SQLite. We assume that the data is in a file called mydata.dat. Note that read.csv.sql does not support H2 as H2 already has the internal csvread SQL function to do that:

library(RH2)
library(sqldf)

first_date <- "2001-11-01"
second_date <- "2003-11-01"

fn$sqldf(c("CREATE TABLE t(DATE TIMESTAMP) AS
  SELECT parsedatetime(DATE, 'd/M/y') as DATE
  FROM CSVREAD('mydata.dat')", 
  "SELECT DATE FROM t WHERE DATE between '`first_date`' and '`second_date`'"))

Note that the first RH2 query will be slow in a session since it loads java. After that you can try it out to see if the performance is adequate.

like image 142
G. Grothendieck Avatar answered Sep 30 '22 10:09

G. Grothendieck