Is there a good way of getting a sample of rows from part of a dataframe? If I just have data such as <pre class="prettyprint"><code>gender <- c("F", "M", "M", "F", "F", "M", "F", "F") age <- c(23, 25, 27, 29, 31, 33, 35, 37) </code></pre> then I can easily sample the ages of three of the Fs with <pre class="prettyprint"><code>sample(age[gender == "F"], 3) </code></pre> and get something like <pre class="prettyprint"><code>[1] 31 35 29 </code></pre> but if I turn this data into a dataframe <pre class="prettyprint"><code>mydf <- data.frame(gender, age) </code></pre> I cannot use the obvious <pre class="prettyprint"><code>sample(mydf[mydf$gender == "F", ], 3) </code></pre> though I can concoct something convoluted with an absurd number of brackets like <pre class="prettyprint"><code>mydf[sample((1:nrow(mydf))[mydf$gender == "F"], 3), ] </code></pre> and get what I want which is something like <pre class="prettyprint"><code> gender age 7 F 35 4 F 29 1 F 23 </code></pre> Is there a better way that takes me less time to work out how to write?

You say I cannot use the obvious: <pre class="prettyprint"><code>sample(mydf[mydf$gender == "F", ], 3) </code></pre> but you could write your own function for doing it: <pre class="prettyprint"><code>sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE] </code></pre> then run it on your subset selection: <pre class="prettyprint"><code>sample.df(mydf[mydf$gender == "F", ], 3) # gender age # 5 F 31 # 4 F 29 # 1 F 23 </code></pre> (Personally I find <code>sample.df(subset(mydf, gender == "F"), 3)</code> easier to read.)

This is now simpler with the enhanced version of <code>sample</code> in my package: <pre class="prettyprint"><code>library(devtools); install_github('kimisc', 'krlmlr') library(kimisc) sample.rows(subset(mydf, gender == "F"), 3) </code></pre> See also this related answer for more detail.

Random sample of rows from subset of an R dataframe [duplicate]

Tags:

dataframe

r

sample

Is there a good way of getting a sample of rows from part of a dataframe?

If I just have data such as

gender <- c("F", "M", "M", "F", "F", "M", "F", "F")
age    <- c(23, 25, 27, 29, 31, 33, 35, 37)

then I can easily sample the ages of three of the Fs with

sample(age[gender == "F"], 3)

and get something like

[1] 31 35 29

but if I turn this data into a dataframe

mydf <- data.frame(gender, age)

I cannot use the obvious

sample(mydf[mydf$gender == "F", ], 3)

though I can concoct something convoluted with an absurd number of brackets like

mydf[sample((1:nrow(mydf))[mydf$gender == "F"], 3), ]

and get what I want which is something like

  gender age
7      F  35
4      F  29
1      F  23

Is there a better way that takes me less time to work out how to write?

782

asked Mar 09 '12 02:03

Henry

3 Answers

Your convoluted way is pretty much how to do it - I think all the answers will be variations on that theme.

For example, I like to generate the mydf$gender=="F" indices first:

idx <- which(mydf$gender=="F")

Then I sample from that:

mydf[ sample(idx,3), ]

So in one line (although, you reduce the absurd number of brackets and possibly make your code easier to understand by having multiple lines):

mydf[ sample( which(mydf$gender=='F'), 3 ), ]

While the "wheee I'm a hacker!" part of me prefers the one-liner, the sensible part of me says that even though the two-liner is two lines, it is much more understandable - it's just your choice.

answered Nov 12 '22 15:11

mathematical.coffee

You say I cannot use the obvious:

sample(mydf[mydf$gender == "F", ], 3)

but you could write your own function for doing it:

sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]

then run it on your subset selection:

sample.df(mydf[mydf$gender == "F", ], 3)
#   gender age
# 5      F  31
# 4      F  29
# 1      F  23

(Personally I find sample.df(subset(mydf, gender == "F"), 3) easier to read.)

answered Nov 12 '22 17:11

flodel

This is now simpler with the enhanced version of sample in my package:

library(devtools); install_github('kimisc', 'krlmlr')

library(kimisc)
sample.rows(subset(mydf, gender == "F"), 3)

See also this related answer for more detail.

answered Nov 12 '22 17:11

krlmlr

Related questions
                            
                                Faster approach than gsub in r
                            
                                How do I clean twitter data in R?
                            
                                How to use an R script from GitHub?
                            
                                Survival not recognizing right censored data
                            
                                How can I avoid complex for loops?
                            
                                Find immediate neighbors by group using data table or igraph
                            
                                How to avoid overplotting (for points) using base-graph?
                            
                                Creating zip file from folders
                            
                                Using one data.frame to update another
                            
                                if-else vs ifelse with lists
                            
                                R vectorized array data manipulation
                            
                                change color for two geom_point() in ggplot2
                            
                                How to use a string variable to select a data frame column using $ notation [duplicate]
                            
                                Create a straight faint dotted/dashed line through y=0
                            
                                R: Replacing NA values by mean of hour with dplyr
                            
                                duplicate rows in a data frame in R
                            
                                Re-ordering bars in R's barplot()
                            
                                Export R data.frame to SPSS
                            
                                R data.table: mean for many columns
                            
                                Using purrr::pmap within mutate to create list-column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With