Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve speed of queries using complex survey design in R

I have a large data set (more than 20 million obs) that I analyze with survey package and it is taking me ages to run simple queries. I have tried to find a way to speed up my code but I would like to know if there better ways to make this more efficient.

In my benchmark, I compare the speed of three commands using svyby/svytotal:

  1. Simple command svyby/svytotal
  2. Parallel computing with foreach dopar using 7 cores
  3. A compiled version of option 2

Spoiler: Option 3 is more than twice as fast as the first option BUT it is not suitable for large data sets because it relies on Parallel computing, which quickly hit memory limits when dealing with large data sets. I also face this problem despite my 16GB of RAM. There are a few solutions to this memory limitation, but none of them are applicable to survey design objects.

Any ideas on how to make it faster and not crash because of memory limits?

My code with a reproducible example:

# Load Packages
library(survey)
library(data.table)
library(compiler)
library(foreach) 
library(doParallel)
options(digits=3)

# Load Data
data(api)

# Convert data to data.table format (mostly to increase speed of the process)
apiclus1 <- as.data.table(apiclus1)

# Multiplicate data observations by 1000 
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 1000), ]

# create a count variable
apiclus1[, Vcount := 1]

# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

1) Simple code

t1 <- Sys.time()
table1 <- svyby(~Vcount,
                ~stype+dnum+cname,
                design = dclus1,
                svytotal)
T1 <- Sys.time() - t1

2) Parallel computing with foreach dopar using 7 cores

# in this option, I create a list with different subsets of the survey design
# that will be passed to different CPU cores to work at the same time

subdesign <- function(i){ subset(dclus1, dnum==i)}
groups <- unique(apiclus1$dnum)
list_subsets <- lapply(groups[], subdesign) # apply function and get all     subsets in a list
i <- NULL

# Start Parallel
registerDoParallel(cores=7)

t2 <- Sys.time()
table2 <- foreach (i = list_subsets,  .combine= rbind, .packages="survey")     %dopar% {
  options( survey.lonely.psu = "remove" )
  svyby(~Vcount,
        ~stype+dnum+cname,
        design = i,
        svytotal)}
T2 <- Sys.time() - t2

3. A compiled version of option 2

# make a function of the previous query
query2 <- function (list_subsets) { foreach (i = list_subsets,  .combine=     rbind, .packages="survey") %dopar% {
  svyby(~Vcount,
        ~stype+dnum+cname,
        design = i,
        svytotal)}}

# Compile the function to increase speed
query3 <- cmpfun(query2 )

t3 <- Sys.time()
table3 <- query3 (list_subsets)
T3 <- Sys.time() - t3

Results

>T1: 1.9 secs
>T2: 1.13 secs
>T3  0.58 secs

barplot(c(T1, T2, T3),  
        names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"),
        ylab="Seconds")

enter image description here

like image 956
rafa.pereira Avatar asked Sep 03 '15 16:09

rafa.pereira


1 Answers

thanks for laying this question out so well. working efficiently with large survey datasets in R probably requires some basic SQL syntax (which is much easier to learn than R). MonetDB is the only large-data option compatible with the survey package, exploring other high performance packages is (probably) not going to be fruitful. generally when i am exploring a huge dataset, i write directly in SQL queries rather than using the survey package because the standard error computations are computationally-intensive (and variances aren't as useful during interactive data exploration). notice how the final SQL timestamp blows away all of the other options. to calculate a quick weighted mean, use something like "SELECT by_column , SUM( your_column * the_weight ) / SUM( the_weight ) FROM yourdata GROUP BY by_column"

when you do need standard errors interactively, linearization (svydesign) is often more computationally-intensive than replication (svrepdesign), but sometimes creating the replication designs (as i have done with the jk1w_dclus1 below) requires more survey methods familiarity than some users are comfortable with.

# Load Packages
library(MonetDB.R)
library(MonetDBLite)
library(DBI)   # suggested in comments and needed on OSX
library(survey)

# Load Data
data(api)

# Multiplicate data observations by 10000 
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 10000), ]

# create a count variable
apiclus1$vcount <- 1

# create survey design
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)


dbfolder <- tempdir()

db <- dbConnect( MonetDBLite() , dbfolder )
dbWriteTable( db , 'apiclus1' , apiclus1 )


db_dclus1 <-
    svydesign(
        weight = ~pw ,
        id = ~dnum ,
        data = "apiclus1" , 
        dbtype = "MonetDBLite" ,
        dbname = dbfolder ,
        fpc = ~fpc
    )

# you provided a design without strata,
# so type="JK1" matches that most closely.
# but see survey:::as.svrepdesign for other linearization-to-replication options
jk1w <- jk1weights( psu = apiclus1$dnum , fpc = apiclus1$fpc )

# after the replicate-weights have been constructed,
# here's the `svrepdesign` call..
jk1w_dclus1 <-
    svrepdesign(
        weight = ~pw ,
        type = "JK1" ,
        repweights = jk1w$repweights ,
        combined.weights = FALSE ,
        scale = jk1w$scale ,
        rscales = jk1w$rscales ,
        data = 'apiclus1' ,
        dbtype = "MonetDBLite" ,
        dbname = dbfolder
    )

# slow
system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
# > system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal))
   # user  system elapsed 
  # 17.40    2.86   20.27 


# faster
system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
# > system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal))
   # user  system elapsed 
  # 13.00    1.20   14.18 


# fastest
system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
# > system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal))
   # user  system elapsed 
  # 10.75    1.19   11.96 

# same standard errors across the board
all.equal( SE( res1 ) , SE( res2 ) )
all.equal( SE( res2 ) , SE( res3 ) )
# NOTE: the replicate-weighted design will be slightly different
# for certain designs.  however this technique is defensible
# and gets used in 
# https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico


# at the point you do not care about standard errors,
# learn some sql:
system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
# because this is near-instantaneous, no matter how much data you have.

# same numbers as res1:
all.equal( as.numeric( sort( coef( res1 ) ) ) , sort( res4$L1 ) )
# > system.time( res4 <- dbGetQuery( db , "SELECT stype , dnum , cname , SUM( pw ) FROM apiclus1 GROUP BY stype , dnum , cname" ) )
   # user  system elapsed 
   # 0.15    0.20    0.23 
like image 129
Anthony Damico Avatar answered Oct 21 '22 05:10

Anthony Damico