Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting random rows by category from a data frame?

Tags:

dataframe

r

I have a data frame as follows:

Category Name Value

How would I select say, 5 random names per category? Using sample returns random rows using all rows as possible candidates. However, I want to specify the number of random rows per category. Any suggestions?

Update: I am open to using ddply

like image 786
Legend Avatar asked Apr 23 '26 05:04

Legend


2 Answers

Best guess in absence of test cases:

  do.call( rbind, lapply( split(dfrm, df$cat) ,
                         function(df) df[sample(nrow(df), 5) , ] )
          )

Tested with Jonathan's data:

> do.call( rbind, lapply( split(df, df$Category) ,
+                          function(df) df[sample(nrow(df), 5) , ] )
+           )

      Category Name      Value   
1.8          1    8 -0.2496109   #  useful side-effect of labeling source group
1.15         1   15 -0.4037368
1.17         1   17 -0.4223724
1.12         1   12 -0.9359026
1.18         1   18  0.3741184
2.37         2   37  0.3033610
2.34         2   34 -0.4517738
2.36         2   36 -0.7695923
snipped remainder
like image 186
IRTFM Avatar answered Apr 24 '26 18:04

IRTFM


If you want the same number of items from each category, this is easy:

df[unlist(tapply(1:nrow(df),df$Category,function(x) sample(x,3))),]

e.g., I generated df as follows:

df <- data.frame(Category=rep(1:5,each=20),Name=1:100,Value=rnorm(100))

then I get the follow from my code:

> df[unlist(tapply(1:nrow(df),df$Category,function(x) sample(x,3))),]
    Category Name       Value
5          1    5  0.25151044
20         1   20  1.52486482
18         1   18  0.69313462
30         2   30  0.73444185
27         2   27  0.24000427
39         2   39 -0.10108203
46         3   46 -0.37200574
49         3   49 -1.84920469
43         3   43  0.35976388
68         4   68  0.57879516
76         4   76 -0.11049302
64         4   64 -0.13471303
100        5  100  0.95979408
95         5   95 -0.01928741
99         5   99  0.85725242

If you want different numbers of rows from each category it will be more complicated.

like image 32
Jonathan Christensen Avatar answered Apr 24 '26 19:04

Jonathan Christensen