Summarizing by groups applying function which involves the next group

Question

Let's assume I have the following data:

set.seed(1)
test <- data.frame(letters=rep(c("A","B","C","D"),10), numbers=sample(1:50, 40, replace=TRUE))

I want to know how many numbers whose letter is A are not in B, how many numbers of B are not in C and so on.

I came up with a solution for this using base functions split and mapply:

s.test <-split(test, test$letters)
notIn <- mapply(function(x,y) sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers), x=names(s.test)[1:3], y=names(s.test)[2:4])

Which gives:

> notIn
A B C 
9 7 7

But I would also like to do this with dplyr or data.table. Is it possible?

Arun · Accepted Answer

The bottleneck seems to be in split. When simulated on 200 groups and 150,000 observations each, split takes 50 seconds out of the total 54 seconds.

The split step can be made drastically faster using data.table as follows.

## test is a data.table here
s.test <- test[, list(list(.SD)), by=letters]$V1

Here's a benchmark on data of your dimensions using data.table + mapply:

## generate data
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE), 
                 numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)

require(data.table)   ## latest CRAN version is v1.9.2
setDT(test)           ## convert to data.table by reference (no copy)
system.time({
    s.test <- test[, list(list(.SD)), by=letters]$V1 ## split
    setattr(s.test, 'names', unique(test$letters))   ## setnames
    notIn <- mapply(function(x,y) 
         sum(!s.test[[x]]$numbers %in% s.test[[y]]$numbers), 
              x=names(s.test)[1:199], y=names(s.test)[2:200])
})

##   user  system elapsed 
##  4.840   1.643   6.624

That's about ~7.5x speedup on your biggest data dimensions. Would this be sufficient?

G. Grothendieck · Answer

This seems to give about the same speedup as with data.table but only uses base R. Instead of splitting the data frame it splits the numbers column only (in line marked ##):

## generate data - from Arun's post
set.seed(1L)
k = 200L
n = 150000L
test <- data.frame(letters=sample(paste0("id", 1:k), n*k, TRUE), 
                 numbers=sample(1e6, n*k, TRUE), stringsAsFactors=FALSE)

system.time({
    s.numbers <- with(test, split(numbers, letters)) ##
    notIn <- mapply(function(x,y) 
         sum(!s.numbers[[x]] %in% s.numbers[[y]]), 
              x=names(s.numbers)[1:199], y=names(s.numbers)[2:200])
})

Summarizing by groups applying function which involves the next group

Tags:

dataframe

r

data.table

dplyr

Carlos Cinelli

2 Answers

Arun

G. Grothendieck

Recent Activity

Donate For Us

Summarizing by groups applying function which involves the next group

Tags:

dataframe

r

data.table

dplyr

Carlos Cinelli

2 Answers

Arun

G. Grothendieck

Related questions

Recent Activity

Donate For Us