Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select only first or last value for each unique value in data table?

Tags:

r

data.table

I have a data table like this.

> dt
    ID value
 1   a  v1
 2   a  v2
 3   a  v3
 4   a  v4
 5   a  v5
 6   b  v6
 7   b  v7
 8   b  v8

and I want to select only one value for each ID. It could be first value or last value. This is how I do it.

unique_id_value_mapping <- dt[, list(new_value=head(.SD[,value],1)), by="ID"]

But for large data tables(~0.1 million rows) it takes a lot of time. Anyone knows a faster way to do it?

UPDATE
The answer suggested for the above problem works fine. But what if I need to pick value based on some condition. Consider a data table

> dt
    ID value days
 1   a  v1     2
 2   a  v2     4
 3   a  v3     7 *
 4   a  v4     7
 5   a  v5     1
 6   b  v6     5 *
 7   b  v7     4
 8   b  v8     2

and I want to select only one value for each ID wherever days is maximum for that ID. This is how I do it.

unique_id_value_mapping <- dt[, list(new_value=head(.SD[days==max(days),value])), by="ID"]

How to do it faster?

like image 912
shubham Avatar asked Nov 30 '25 02:11

shubham


1 Answers

Try

dt[, list(new_value=value[1L]), ID]
dt[, list(new_value= value[.N]), ID]

Using a bigger dataset

set.seed(24)
df1 <- data.frame(ID= sample(1:100, 1e6, replace=TRUE), 
     value=rnorm(1e6))
dt1 <- as.data.table(df1)
system.time(dt1[, list(new_value=value[1L]), ID])
#   user  system elapsed 
#  0.012   0.000   0.013 
 system.time(dt1[, list(new_value=value[.N]), ID])
#  user  system elapsed 
#  0.011   0.000   0.012 

Update

Based on the new update, as @David Arenburg suggested

 dt[, list(new_value = value[which.max(days)]), by = ID]
 #    ID new_value
 #1:  a        v3
 #2:  b        v6

Suppose you need the row that meets the condition

 dt[dt[, .I[which.max(days)], by = ID]$V1]
 #   ID value days
 #1:  a    v3    7
 #2:  b    v6    5

Or

 dt[, .SD[which.max(days)], by = ID]
like image 59
akrun Avatar answered Dec 02 '25 16:12

akrun