Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Take randomly sample based on groups

I have a data frame made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations). Data frame looks like:

        ID  Year    Temp    ph
1       P1  1996    11.3    6.80
2       P1  1996    9.7     6.90
3       P1  1997    9.8     7.10
...
2000    P2  1997    10.5    6.90
2001    P2  1997    9.9     7.00
2002    P2  1997    10.0    6.93

I want to take 500 random rows for every ID (so 500 for P1, 500 for P2,....) and create a new df. I try:

new_df<-df[df$ID %in% sample(unique(dfID),500),]

But it takes randomly one ID, while I need 500 random rows for every ID.

like image 947
matteo Avatar asked Aug 15 '13 17:08

matteo


People also ask

What is random sampling technique?

Definition: Random sampling is a part of the sampling technique in which each sample has an equal probability of being chosen. A sample chosen randomly is meant to be an unbiased representation of the total population.

How do I select a random row by group in SQL?

Below SQL statement is to display the defined number of random rows from a table using RAND() function: Query: SELECT * FROM table_name order by RANDOM() LIMIT n; In table_name mention your Table Name and in the place of 'n' give how many rows to be fetched.


2 Answers

This is available as the slice_sample function in dplyr:

library(dplyr) new_df <- df %>% group_by(ID) %>% slice_sample(n=500) 

In older versions of R, the function was called sample_n, which has been deprecated.

like image 186
drhagen Avatar answered Sep 23 '22 23:09

drhagen


Try this:

library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])
like image 24
joran Avatar answered Sep 24 '22 23:09

joran