Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find rows with a given difference between values in a column

For a data.table (or data.frame) in R, I wish to find all rows which contain a value in column 'value' which are a given distance 'distance' from another that value in row with the same key. So, given the following:

distance <- 22
   key value
   A     1
   B     1
   C     1
   D     1
   A     4
   B     4
   A    23
   B    23
   B    26
   B    26
   C    30

I would like to annotated the original table with a count of how many rows exist with the same key, and a value that is +22 from it:

  key value count
  A     1     1
  B     1     1
  C     1     0
  D     1     0
  A     4     0
  B     4     2
  A    23     0
  B    23     0
  B    26     0
  B    26     0
  C    30     0

I don't really know where to begin with this self-referential approach to manipulating data in R. My initial attempts involved creating a second table and trying to match against that, but that seemed a strange and poor approach.

Note: I'm using the data.table package but I'm happy to work from data.frame in this case if that makes things easier.

Reproducible:

require(data.table)
source <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B", "C"),value=c(1,1,1,1,4,4,23,23,26,26,30)))
result <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B","C"),value=c(1,1,1,1,4,4,23,23,26,26,30),count=c(1,1,0,0,0,2,0,0,0,0,0)))
like image 353
Ina Avatar asked May 23 '12 18:05

Ina


2 Answers

Here's a data.table based solution. I'll be interested to learn what (if any) improvements can be made to it.

# Your code
library(data.table)
source <- 
data.table(data.frame(key = c("A","B","C","D","A","B","A","B","B","B", "C"),
                      value = c(1,1,1,1,4,4,23,23,26,26,30)))

That strange data.table(data.frame(... is because data.table() has an argument called key, too. That's one way to create a data.table with a column called "key". Capitalising to avoid the argument name conflict illustrates the more standard syntax :

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))

Next to avoid needing as.integer() later, we'll change the type of the Value column from numeric to integer now. Remember than 1 is numeric in R, it is 1L that is integer. It is usually better for efficiency to store integer data as integer, than integer as numeric. The next line is easier than typing lots of Ls above.

source[,Value:=as.integer(Value)]   # change type from `numeric` to `integer`

Now proceed

distance <- 22L
setkey(source, Key, Value)

# Heart of the solution (following a few explanatory comments):
#  "J()"   : shorthand for 'data.table()'
#  ".N"    : returns the number of rows that matched a line (see ?data.table)
#  "[[3]]" : as with simple data.frames, extracts the vector in column 3

source[,count:=source[J(Key,Value+distance),.N][[3]]]
source
      key value count
 [1,]   A     1     1
 [2,]   A     4     0
 [3,]   A    23     0
 [4,]   B     1     1
 [5,]   B     4     2
 [6,]   B    23     0
 [7,]   B    26     0
 [8,]   B    26     0
 [9,]   C     1     0
[10,]   C    30     0
[11,]   D     1     0

Note that := changed source by reference directly, so that's it. But setkey() also changed the order of the original data. If retaining the original order is required, then:

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))
source[,Value:=as.integer(Value)]   
source[,count:=setkey(copy(source))[source[,list(Key,Value+distance)],.N][[3]]]

      Key Value count
 [1,]   A     1     1
 [2,]   B     1     1
 [3,]   C     1     0
 [4,]   D     1     0
 [5,]   A     4     0
 [6,]   B     4     2
 [7,]   A    23     0
 [8,]   B    23     0
 [9,]   B    26     0
[10,]   B    26     0
[11,]   C    30     0
like image 122
Josh O'Brien Avatar answered Sep 17 '22 00:09

Josh O'Brien


You could use mapply to loop through all combinations of key and value:

data.table(t(mapply(function(key,val) 
      c(key=key,value=val,count=length(source$value[source$key==key & source$value>(val+distance)]) )
   , as.character(source$key),source$value)))
like image 27
nograpes Avatar answered Sep 19 '22 00:09

nograpes