For a data.table (or data.frame) in R, I wish to find all rows which contain a value in column 'value' which are a given distance 'distance' from another that value in row with the same key. So, given the following:
distance <- 22
key value
A 1
B 1
C 1
D 1
A 4
B 4
A 23
B 23
B 26
B 26
C 30
I would like to annotated the original table with a count of how many rows exist with the same key, and a value that is +22 from it:
key value count
A 1 1
B 1 1
C 1 0
D 1 0
A 4 0
B 4 2
A 23 0
B 23 0
B 26 0
B 26 0
C 30 0
I don't really know where to begin with this self-referential approach to manipulating data in R. My initial attempts involved creating a second table and trying to match against that, but that seemed a strange and poor approach.
Note: I'm using the data.table
package but I'm happy to work from data.frame in this case if that makes things easier.
Reproducible:
require(data.table)
source <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B", "C"),value=c(1,1,1,1,4,4,23,23,26,26,30)))
result <- data.table(data.frame(key=c("A","B","C","D","A","B","A","B","B","B","C"),value=c(1,1,1,1,4,4,23,23,26,26,30),count=c(1,1,0,0,0,2,0,0,0,0,0)))
Here's a data.table
based solution. I'll be interested to learn what (if any) improvements can be made to it.
# Your code
library(data.table)
source <-
data.table(data.frame(key = c("A","B","C","D","A","B","A","B","B","B", "C"),
value = c(1,1,1,1,4,4,23,23,26,26,30)))
That strange data.table(data.frame(...
is because data.table()
has an argument called key
, too. That's one way to create a data.table
with a column called "key"
. Capitalising to avoid the argument name conflict illustrates the more standard syntax :
source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
Value = c(1,1,1,1,4,4,23,23,26,26,30))
Next to avoid needing as.integer()
later, we'll change the type of the Value
column from numeric
to integer
now. Remember than 1
is numeric
in R, it is 1L
that is integer
. It is usually better for efficiency to store integer
data as integer
, than integer
as numeric
. The next line is easier than typing lots of L
s above.
source[,Value:=as.integer(Value)] # change type from `numeric` to `integer`
Now proceed
distance <- 22L
setkey(source, Key, Value)
# Heart of the solution (following a few explanatory comments):
# "J()" : shorthand for 'data.table()'
# ".N" : returns the number of rows that matched a line (see ?data.table)
# "[[3]]" : as with simple data.frames, extracts the vector in column 3
source[,count:=source[J(Key,Value+distance),.N][[3]]]
source
key value count
[1,] A 1 1
[2,] A 4 0
[3,] A 23 0
[4,] B 1 1
[5,] B 4 2
[6,] B 23 0
[7,] B 26 0
[8,] B 26 0
[9,] C 1 0
[10,] C 30 0
[11,] D 1 0
Note that :=
changed source
by reference directly, so that's it. But setkey()
also changed the order of the original data. If retaining the original order is required, then:
source <- data.table(Key = c("A","B","C","D","A","B","A","B","B","B","C"),
Value = c(1,1,1,1,4,4,23,23,26,26,30))
source[,Value:=as.integer(Value)]
source[,count:=setkey(copy(source))[source[,list(Key,Value+distance)],.N][[3]]]
Key Value count
[1,] A 1 1
[2,] B 1 1
[3,] C 1 0
[4,] D 1 0
[5,] A 4 0
[6,] B 4 2
[7,] A 23 0
[8,] B 23 0
[9,] B 26 0
[10,] B 26 0
[11,] C 30 0
You could use mapply
to loop through all combinations of key and value:
data.table(t(mapply(function(key,val)
c(key=key,value=val,count=length(source$value[source$key==key & source$value>(val+distance)]) )
, as.character(source$key),source$value)))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With