I am trying to find a way to determine when a set of columns changes value in a data.frame. Let me get straight to the point, please consider the following example:
x<-data.frame(cnt=1:10, code=rep('ELEMENT 1',10), val0=rep(5,10), val1=rep(6,10),val2=rep(3,10))
x[4,]$val0=6
The data.frame above should be read as: The scores for 'ELEMENT 1' started as 5,6,3, remained as is until the 4 iteration when they changed to 6,6,3, and then changed back to 5,6,3.
My question, is there a way to get the 1st, 4th, and 5th row of the data.frame? Is there a way to detect when the columns change? (There are 12 columns btw)
I tried using the duplicated of data.table (which worked perfectly in the majority of the cases) but in this case it will remove all duplicates and leave rows 1 and 4 only (removing the 5th).
Do you have any suggestions? I would rather not use a for loop as there are approx. 2M lines.
In data.table
version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist
that does exactly this. And it's also written in C and is therefore terribly fast.
require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5])
# [1] 1 4 5
If you're using the development version of data.table
(1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist
, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.
require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5
Totally unreadable, but:
c(1,which(rowSums(sapply(x[,grep('val',names(x))],diff))!=0)+1)
# [1] 1 4 5
Basically, run diff
on each row, to find all the changes. If a change occurs in any column, then a change has occurred in the row.
Also, without the sapply
:
c(1,which(rowSums(diff(as.matrix(x[,grep('val',names(x))])))!=0)+1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With