I have a data table like this.
> dt
ID value
1 a v1
2 a v2
3 a v3
4 a v4
5 a v5
6 b v6
7 b v7
8 b v8
and I want to select only one value for each ID. It could be first value or last value. This is how I do it.
unique_id_value_mapping <- dt[, list(new_value=head(.SD[,value],1)), by="ID"]
But for large data tables(~0.1 million rows) it takes a lot of time. Anyone knows a faster way to do it?
UPDATE
The answer suggested for the above problem works fine.
But what if I need to pick value based on some condition. Consider a data table
> dt
ID value days
1 a v1 2
2 a v2 4
3 a v3 7 *
4 a v4 7
5 a v5 1
6 b v6 5 *
7 b v7 4
8 b v8 2
and I want to select only one value for each ID wherever days is maximum for that ID. This is how I do it.
unique_id_value_mapping <- dt[, list(new_value=head(.SD[days==max(days),value])), by="ID"]
How to do it faster?
Try
dt[, list(new_value=value[1L]), ID]
dt[, list(new_value= value[.N]), ID]
Using a bigger dataset
set.seed(24)
df1 <- data.frame(ID= sample(1:100, 1e6, replace=TRUE),
value=rnorm(1e6))
dt1 <- as.data.table(df1)
system.time(dt1[, list(new_value=value[1L]), ID])
# user system elapsed
# 0.012 0.000 0.013
system.time(dt1[, list(new_value=value[.N]), ID])
# user system elapsed
# 0.011 0.000 0.012
Based on the new update, as @David Arenburg suggested
dt[, list(new_value = value[which.max(days)]), by = ID]
# ID new_value
#1: a v3
#2: b v6
Suppose you need the row that meets the condition
dt[dt[, .I[which.max(days)], by = ID]$V1]
# ID value days
#1: a v3 7
#2: b v6 5
Or
dt[, .SD[which.max(days)], by = ID]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With