Find rows with a given difference between values in a column

Question

For a data.table (or data.frame) in R, I wish to find all rows which contain a value in column 'value' which are a given distance 'distance' from another that value in row with the same key. So, given the following:

distance <- 22
   key value
   A     1
   B     1
   C     1
   D     1
   A     4
   B     4
   A    23
   B    23
   B    26
   B    26
   C    30

I would like to annotated the original table with a count of how many rows exist with the same key, and a value that is +22 from it:

  key value count
  A     1     1
  B     1     1
  C     1     0
  D     1     0
  A     4     0
  B     4     2
  A    23     0
  B    23     0
  B    26     0
  B    26     0
  C    30     0

I don't really know where to begin with this self-referential approach to manipulating data in R. My initial attempts involved creating a second table and trying to match against that, but that seemed a strange and poor approach.

Note: I'm using the data.table package but I'm happy to work from data.frame in this case if that makes things easier.

Reproducible:

require(data.table)
source <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B", "C"),value=c(1,1,1,1,4,4,23,23,26,26,30)))
result <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B","C"),value=c(1,1,1,1,4,4,23,23,26,26,30),count=c(1,1,0,0,0,2,0,0,0,0,0)))

Josh O'Brien · Accepted Answer

Here's a data.table based solution. I'll be interested to learn what (if any) improvements can be made to it.

# Your code
library(data.table)
source <- 
data.table(data.frame(key = c("A","B","C","D","A","B","A","B","B","B", "C"),
                      value = c(1,1,1,1,4,4,23,23,26,26,30)))

That strange data.table(data.frame(... is because data.table() has an argument called key, too. That's one way to create a data.table with a column called "key". Capitalising to avoid the argument name conflict illustrates the more standard syntax :

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))

Next to avoid needing as.integer() later, we'll change the type of the Value column from numeric to integer now. Remember than 1 is numeric in R, it is 1L that is integer. It is usually better for efficiency to store integer data as integer, than integer as numeric. The next line is easier than typing lots of Ls above.

source[,Value:=as.integer(Value)]   # change type from `numeric` to `integer`

Now proceed

distance <- 22L
setkey(source, Key, Value)

# Heart of the solution (following a few explanatory comments):
#  "J()"   : shorthand for 'data.table()'
#  ".N"    : returns the number of rows that matched a line (see ?data.table)
#  "[[3]]" : as with simple data.frames, extracts the vector in column 3

source[,count:=source[J(Key,Value+distance),.N][[3]]]
source
      key value count
 [1,]   A     1     1
 [2,]   A     4     0
 [3,]   A    23     0
 [4,]   B     1     1
 [5,]   B     4     2
 [6,]   B    23     0
 [7,]   B    26     0
 [8,]   B    26     0
 [9,]   C     1     0
[10,]   C    30     0
[11,]   D     1     0

Note that := changed source by reference directly, so that's it. But setkey() also changed the order of the original data. If retaining the original order is required, then:

source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
                     Value = c(1,1,1,1,4,4,23,23,26,26,30))
source[,Value:=as.integer(Value)]   
source[,count:=setkey(copy(source))[source[,list(Key,Value+distance)],.N][[3]]]

      Key Value count
 [1,]   A     1     1
 [2,]   B     1     1
 [3,]   C     1     0
 [4,]   D     1     0
 [5,]   A     4     0
 [6,]   B     4     2
 [7,]   A    23     0
 [8,]   B    23     0
 [9,]   B    26     0
[10,]   B    26     0
[11,]   C    30     0

nograpes · Answer

You could use mapply to loop through all combinations of key and value:

data.table(t(mapply(function(key,val) 
      c(key=key,value=val,count=length(source$value[source$key==key & source$value>(val+distance)]) )
   , as.character(source$key),source$value)))

Find rows with a given difference between values in a column

Tags:

dataframe

r

data.table

Ina

2 Answers

Josh O'Brien

nograpes

Recent Activity

Donate For Us

Find rows with a given difference between values in a column

Tags:

dataframe

r

data.table

Ina

2 Answers

Josh O'Brien

nograpes

Related questions

Recent Activity

Donate For Us