I have a data.table with fields {id, menuitem, amount}.
This is transaction data - so, ids are unique, but menuitem repeats. Now, I want to remove all entries where menuitem == 'coffee'
.
Also, want to delete all rows where amount <= 0
;
What is the right way to do this in data.table?
I can use data$menuitem!='coffee'
and then index int into data[] - but that is not necessarily efficient and does not take advantage of data.table.
Any pointers in the right direction are appreciated.
In this scenario it is not so different than data.frame
data <- data[ menuitem != 'coffee' | amount > 0]
Delete/add row by reference it is to be implemented. You find more info in this question
Regarding speed:
1 You can benefit from keys by doing something like:
setkey(data, menuitem) data <- data[!"coffee"]
which will be faster than data <- data[ menuitem != 'coffee']
. However to apply the same filters you asked in the question you'll need a rolling join (I've finished my lunch break I can add something later :-)).
2 Even without key data.table is much faster for relatively big table (similar speed for handful amount of rows)
dt<-data.table(id=sample(letters,1000000,T),var=rnorm(1000000)) df<-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000)) library(microbenchmark) > microbenchmark(dt[ id == "a"], df[ df$id == "a",]) Unit: milliseconds expr min lq median uq max neval dt[id == "a"] 24.42193 25.74296 26.00996 26.35778 27.36355 100 df[df$id == "a", ] 138.17500 146.46729 147.38646 149.06766 154.10051 100
try this:
data <- data[ !(menuitem == 'coffee' | amount <= 0),]
Generally:
dt <- data.table(a=c(1,1,1,2,2,2,3,3,3),b=c(4,2,3,1,5,3,4,7,6)) dt #> a b #> 1: 1 4 #> 2: 1 2 #> 3: 1 3 #> 4: 2 1 #> 5: 2 5 #> 6: 2 3 #> 7: 3 4 #> 8: 3 7 #> 9: 3 6 dt[a!=1,] #> a b #> 1: 2 1 #> 2: 2 5 #> 3: 2 3 #> 4: 3 4 #> 5: 3 7 #> 6: 3 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With