Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I group by a variable and list by a random order in data.table?

I have a variable that I want to group by. That is easy. However, I want the resultant table to list its rows by random order. What I actually want to do is a little more complicated. But allow me to show you a simplified version.

mydf = data.table(
   x = rep(1:4, each = 5),
   y = rep(c('A', 'B','c','D', 'E'), times = 2),
   v = rpois(20, 30)
)

mydf[,list(sum(x),sum(v)), by=y]
mydf[,list(sum(x),sum(v)), by=list(y=sample(y))]

#to list all the raw data in order of y


mydf[,list(x,v), by=y]
mydf[,list(x,v), by=list(y=sample(y))]

If you look at the resultant outputs you will notice that the y is indeed in random order but it has become unhinged from the data that was in the rows with it.

What can I do?

like image 671
Farrel Avatar asked Mar 23 '23 12:03

Farrel


2 Answers

I would do the operation and then order randomly:

mydf[,list(x,v),by=y][sample(seq_len(nrow(mydf)),replace=FALSE)]

EDIT: Random reordering, after grouping:

mydf[,list(sum(x),sum(v)), by=y][sample(seq_len(length(y)),replace=FALSE)]

You can do something like this to group and random order before grouping, and it looks like it does preserve the changed order:

mydf[order(setNames(sample(unique(y)),unique(y))[y])]
mydf[order(setNames(sample(unique(y)),unique(y))[y]),list(sum(x),sum(v)),by=y]

#perhaps more readable:
mydf[{z <- unique(y); order(setNames(sample(z),z)[y])}]
mydf[{z <- unique(y); order(setNames(sample(z),z)[y])},list(sum(x),sum(v)),by=y]

This is more transparent by adding a column first before ordering.

mydf[,new.y := setNames(sample(unique(y)),unique(y))[y]][order(new.y)]

Breaking it down:

##a random ordering of the elements of y 
##(set.seed is used here to get consistent results)
set.seed(1); mydf[,{z <- unique(y);sample(z)}]
# [1] "B" "E" "D" "c" "A"
##assigning names to the elements of y
##creating a 1-1 bijective function between the elements of y
set.seed(1); mydf[,{z <- unique(y);setNames(sample(z),z)}]
#  A   B   c   D   E 
#"B" "E" "D" "c" "A" 
##subsetting by y puts y through the map
##in effect every element of y is posing as an element of y, picked at random
##notice that the names (top row) are the original y
##the values (bottom row) are the mapped-to values
#  A   B   c   D   E   A   B   c   D   E   A   B   c   D   E   A   B   c   D   E 
#"B" "E" "D" "c" "A" "B" "E" "D" "c" "A" "B" "E" "D" "c" "A" "B" "E" "D" "c" "A"
##ordering by this now orders by the mapped-to values
set.seed(1); mydf[{z <- unique(y);order(setNames(sample(z),z)[y])}]

EDIT: Incorporating Arun's suggestion in the comments to use setattr to set the names:

mydf[{z <- unique(y); order(setattr(sample(z),'names',z)[y])}]
mydf[{z <- unique(y); order(setattr(sample(z),'names',z)[y])},list(sum(x),sum(v)),by=y]
like image 99
Blue Magister Avatar answered Apr 06 '23 01:04

Blue Magister


I think this is what you're looking for...?

mydf[,.SD[sample(.N)],by=y]

Inspired by @BlueMagister's second solution, here's the randomize-first way:

mydf[sample(nrow(mydf)),.SD,by=y]

Here, use keyby instead of by if you want the groups to appear in alphabetical order.

like image 38
Frank Avatar answered Apr 06 '23 02:04

Frank