I'm looking to use <code>data.table</code> to improve speed for a given function, but I'm not sure I'm implementing it the correct way: Data Given two <code>data.table</code>s (<code>dt</code> and <code>dt_lookup</code>) <pre class="prettyprint"><code>library(data.table) set.seed(1234) t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26] n <- 10000 dt <- data.table(id=seq(1:n), thisTime=sample(t, n, replace=TRUE), thisLocation=sample(la,n,replace=TRUE), finalLocation=sample(lb,n,replace=TRUE)) setkey(dt, thisLocation) set.seed(4321) dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)), lkpTime=sample(t, 10000, replace=TRUE), lkpLocation=sample(l, 10000, replace=TRUE)) ## NOTE: lkpId is purposly recycled setkey(dt_lookup, lkpLocation) </code></pre> I have a function that finds the <code>lkpId</code> that contains both <code>thisLocation</code> and <code>finalLocation</code>, and has the 'nearest' <code>lkpTime</code> (i.e. the minimum non-negative value of <code>thisTime - lkpTime</code>) Function <pre class="prettyprint"><code>## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation, ## with the minimum non-negative time between thisTime and dt_lookup$lkpTime) getId <- function(thisTime, thisLocation, finalLocation){ ## filter lookup based on thisLocation and finalLocation, ## and only return values where the lkpId has both 'this' and 'final' locations tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId]) tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId]) availServices <- tempThis[tempThis %in% tempFinal] tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)] ## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation) temp2 <- thisTime - tempThisFinal$lkpTime ## take the lkpId with the minimum non-negative difference selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId] selectedId } </code></pre> Attempts at a solution I need to get the <code>lkpId</code> for each row of <code>dt</code>. Therefore, my initial instinct was to use an <code>*apply</code> function, but it was taking too long (for me) when <code>n/nrow > 1,000,000</code>. So I've tried to implement a <code>data.table</code> solution to see if it's faster: <pre class="prettyprint"><code>selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id] </code></pre> However, I'm fairly new to <code>data.table</code>, and this method doesn't appear to give any performance gains over an <code>*apply</code> solution: <pre class="prettyprint"><code>lkpIds <- apply(dt, 1, function(x){ thisLocation <- as.character(x[["thisLocation"]]) finalLocation <- as.character(x[["finalLocation"]]) thisTime <- as.numeric(x[["thisTime"]]) myId <- getId(thisTime, thisLocation, finalLocation) }) </code></pre> both taking ~30 seconds for n = 10,000. Question Is there a better way of using <code>data.table</code> to apply the <code>getId</code> function over each row of <code>dt</code> ? Update 12/08/2015 Thanks to the pointer from @eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of <code>data.table</code>. I'll write up an answer later.

Having spent the time since asking this question looking into what <code>data.table</code> has to offer, researching <code>data.table</code> joins thanks to @eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution. One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins. And, there will no doubt be better ways of programming this, but here's my attempt. <pre class="prettyprint"><code>## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime' ## and where the lkpId contains both 'thisLocation' and 'finalLocation' ## find all lookup id's where 'thisLocation' matches 'lookupLocation' ## and where thisTime - lkpTime > 0 setkey(dt, thisLocation) setkey(dt_lookup, lkpLocation) dt_this <- dt[dt_lookup, { idx = thisTime - i.lkpTime > 0 .(id = id[idx], lkpId = i.lkpId, thisTime = thisTime[idx], lkpTime = i.lkpTime) }, by=.EACHI] ## remove NAs dt_this <- dt_this[complete.cases(dt_this)] ## find all matching 'finalLocation' and 'lookupLocaiton' setkey(dt, finalLocation) ## inner join (and only return the id columns) dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)] ## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation') setkey(dt_this, id, lkpId) setkey(dt_final, id, lkpId) dt_join <- dt_this[dt_final, nomatch=0] ## take the combination with the minimum difference between 'thisTime' and 'lkpTime' dt_join[,timeDiff := thisTime - lkpTime] dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1] ## equivalent dplyr code # library(dplyr) # dt_this <- dt_this %>% # group_by(id) %>% # arrange(timeDiff) %>% # slice(1) %>% # ungroup </code></pre>

r - apply function to each row of a data.table

Tags:

r

data.table

I'm looking to use data.table to improve speed for a given function, but I'm not sure I'm implementing it the correct way:

Data

Given two data.tables (dt and dt_lookup)

library(data.table)
set.seed(1234)
t <- seq(1,100); l <- letters; la <- letters[1:13]; lb <- letters[14:26]
n <- 10000
dt <- data.table(id=seq(1:n), 
                 thisTime=sample(t, n, replace=TRUE), 
                 thisLocation=sample(la,n,replace=TRUE),
                 finalLocation=sample(lb,n,replace=TRUE))
setkey(dt, thisLocation)

set.seed(4321)
dt_lookup <- data.table(lkpId = paste0("l-",seq(1,1000)),
                        lkpTime=sample(t, 10000, replace=TRUE),
                        lkpLocation=sample(l, 10000, replace=TRUE))
## NOTE: lkpId is purposly recycled
setkey(dt_lookup, lkpLocation)

I have a function that finds the lkpId that contains both thisLocation and finalLocation, and has the 'nearest' lkpTime (i.e. the minimum non-negative value of thisTime - lkpTime)

Function

## function to get the 'next' lkpId (i.e. the lkpId with both thisLocation and finalLocation,
## with the minimum non-negative time between thisTime and dt_lookup$lkpTime)
getId <- function(thisTime, thisLocation, finalLocation){

  ## filter lookup based on thisLocation and finalLocation,
  ## and only return values where the lkpId has both 'this' and 'final' locations
  tempThis <- unique(dt_lookup[lkpLocation == thisLocation,lkpId])
  tempFinal <- unique(dt_lookup[lkpLocation == finalLocation,lkpId])
  availServices <- tempThis[tempThis %in% tempFinal]

  tempThisFinal <- dt_lookup[lkpId %in% availServices & lkpLocation==thisLocation, .(lkpId, lkpTime)]

  ## calcualte time difference between 'thisTime' and 'lkpTime' (from thisLocation)
  temp2 <- thisTime - tempThisFinal$lkpTime

  ## take the lkpId with the minimum non-negative difference
  selectedId <- tempThisFinal[min(which(temp2==min(temp2[temp2>0]))),lkpId]
  selectedId
}

Attempts at a solution

I need to get the lkpId for each row of dt. Therefore, my initial instinct was to use an *apply function, but it was taking too long (for me) when n/nrow > 1,000,000. So I've tried to implement a data.table solution to see if it's faster:

selectedId <- dt[,.(lkpId = getId(thisTime, thisLocation, finalLocation)),by=id]

However, I'm fairly new to data.table, and this method doesn't appear to give any performance gains over an *apply solution:

lkpIds <- apply(dt, 1, function(x){
  thisLocation <- as.character(x[["thisLocation"]])
  finalLocation <- as.character(x[["finalLocation"]])
  thisTime <- as.numeric(x[["thisTime"]])
  myId <- getId(thisTime, thisLocation, finalLocation)
})

both taking ~30 seconds for n = 10,000.

Question

Is there a better way of using data.table to apply the getId function over each row of dt ?

Update 12/08/2015

Thanks to the pointer from @eddi I've redesigned my whole algorithm and am making use of rolling joins (a good introduction), thus making proper use of data.table. I'll write up an answer later.

321

asked Aug 11 '15 05:08

tospig

1 Answers

Having spent the time since asking this question looking into what data.table has to offer, researching data.table joins thanks to @eddi's pointer (for example Rolling join on data.table, and inner join with inequality), I've come up with a solution.

One of the tricky parts was moving away from the thought of 'apply a function to each row', and redesigning the solution to use joins.

And, there will no doubt be better ways of programming this, but here's my attempt.

## want to find a lkpId for each id, that has the minimum difference between 'thisTime' and 'lkpTime'
## and where the lkpId contains both 'thisLocation' and 'finalLocation'

## find all lookup id's where 'thisLocation' matches 'lookupLocation'
## and where thisTime - lkpTime > 0
setkey(dt, thisLocation)
setkey(dt_lookup, lkpLocation)

dt_this <- dt[dt_lookup, {
  idx = thisTime - i.lkpTime > 0
  .(id = id[idx],
    lkpId = i.lkpId,
    thisTime = thisTime[idx],
    lkpTime = i.lkpTime)
},
by=.EACHI]

## remove NAs
dt_this <- dt_this[complete.cases(dt_this)]

## find all matching 'finalLocation' and 'lookupLocaiton'
setkey(dt, finalLocation)
## inner join (and only return the id columns)
dt_final <- dt[dt_lookup, nomatch=0, allow.cartesian=TRUE][,.(id, lkpId)]

## join dt_this to dt_final (as lkpId must have both 'thisLocation' and 'finalLocation')
setkey(dt_this, id, lkpId)
setkey(dt_final, id, lkpId)

dt_join <- dt_this[dt_final, nomatch=0]

## take the combination with the minimum difference between 'thisTime' and 'lkpTime'
dt_join[,timeDiff := thisTime - lkpTime]

dt_join <- dt_join[ dt_join[order(timeDiff), .I[1], by=id]$V1]  

## equivalent dplyr code
# library(dplyr)
# dt_this <- dt_this %>%
#   group_by(id) %>%
#   arrange(timeDiff) %>%
#   slice(1) %>%
#   ungroup

100

answered Sep 23 '22 06:09

tospig

Related questions
                            
                                Error in parse_aws_s3_response, Forbidden (http 403)
                            
                                How to wrap around the polar coordinate limits in ggplot2?
                            
                                Add wrld3d Maps to R Leaflet Package
                            
                                Counter intuitive testing for whole numbers: 63 = (45 x 1.4) = 62
                            
                                Combining a map and a XY ggplot chart in R
                            
                                Variable not found with data mask
                            
                                CRAN check: '\R' is an unrecognized escape in character string starting "'D:\temp\R"
                            
                                Trouble installing R packages with devtools on Travis
                            
                                Module 'rpy2.robjects.pandas2ri' has no attribute 'ri2py'
                            
                                Change font type (e.g., bold) after already specifying font in pdf() function in R
                            
                                r Blastula Error in curl::curl_fetch_memory(url, handle = h) : MAIL failed: 530
                            
                                Display several code chunks in a concise way
                            
                                Is there an R equivalent of strtotime
                            
                                Color-coding 95% confidence ellipses for centroids
                            
                                Prevent line overflow in R documentation?
                            
                                "parse_dt" not resolved from current namespace (lubridate)
                            
                                Overriding system defaults for C++ compilation flags from R
                            
                                R WebCrawler - XML content does not seem to be XML:
                            
                                How to implement subset replacement for S4 methods
                            
                                Document a shiny application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

r - apply function to each row of a data.table

Tags:

r

data.table

tospig

People also ask

1 Answers

tospig

Recent Activity

Donate For Us