Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table subsetting within a group and splitting a data table into two

Tags:

r

data.table

I have the following data.table.

ts,id
1,a
2,a
3,a
4,a
5,a
6,a
7,a
1,b
2,b
3,b
4,b

I want to subset this data.table into two. The criteria is to have approximately the first half for each group (in this case column "id") in one data table and the remaining in another data.table. So the expected result are two data.tables as follows

ts,id
1,a
2,a
3,a
4,a
1,b
2,b

and

 ts,id
 5,a
 6,a
 7,a
 3,b
 4,b

I tried the following,

z1 = x[,.SD[.I < .N/2,],by=dev]
z1

and got just the following

id ts
a  1
a  2
a  3

Somehow, .I within the .SD isn't working the way I think it should. Any help appreciated. Thanks in advance.

like image 891
broccoli Avatar asked Feb 15 '23 13:02

broccoli


1 Answers

.I gives the row locations with respect to the whole data.table. Thus it can't be used like that within .SD.

Something like

DT[, subset := seq_len(.N) > .N/2,by='id']

subset1 <- DT[(subset)][,subset:=NULL]
subset2 <- DT[!(subset)][,subset:=NULL]

subset1
#    ts id
# 1:  4  a
# 2:  5  a
# 3:  6  a
# 4:  7  a
# 5:  3  b
# 6:  4  b
subset2
#   ts id
# 1:  1  a
# 2:  2  a
# 3:  3  a
# 4:  1  b
# 5:  2  b

Should work

For more than 2 groups, you could use cut to create a factor with the appropriate number of levels

Something like

 DT[, subset := cut(seq_len(.N), 3, labels= FALSE),by='id']
 # you could copy to the global environment a subset for each, but this 
 # will not be memory efficient!
 list2env(setattr(split(DT, DT[['subset']]),'names', paste0('s',1:3)), .GlobalEnv)
like image 64
mnel Avatar answered Feb 18 '23 11:02

mnel