Is there a good way of getting a sample of rows from part of a dataframe?
If I just have data such as
gender <- c("F", "M", "M", "F", "F", "M", "F", "F")
age <- c(23, 25, 27, 29, 31, 33, 35, 37)
then I can easily sample the ages of three of the Fs with
sample(age[gender == "F"], 3)
and get something like
[1] 31 35 29
but if I turn this data into a dataframe
mydf <- data.frame(gender, age)
I cannot use the obvious
sample(mydf[mydf$gender == "F", ], 3)
though I can concoct something convoluted with an absurd number of brackets like
mydf[sample((1:nrow(mydf))[mydf$gender == "F"], 3), ]
and get what I want which is something like
gender age
7 F 35
4 F 29
1 F 23
Is there a better way that takes me less time to work out how to write?
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.
sample_n() function in R Language is used to take random sample specimens from a data frame.
Your convoluted way is pretty much how to do it - I think all the answers will be variations on that theme.
For example, I like to generate the mydf$gender=="F"
indices first:
idx <- which(mydf$gender=="F")
Then I sample from that:
mydf[ sample(idx,3), ]
So in one line (although, you reduce the absurd number of brackets and possibly make your code easier to understand by having multiple lines):
mydf[ sample( which(mydf$gender=='F'), 3 ), ]
While the "wheee I'm a hacker!" part of me prefers the one-liner, the sensible part of me says that even though the two-liner is two lines, it is much more understandable - it's just your choice.
You say I cannot use the obvious:
sample(mydf[mydf$gender == "F", ], 3)
but you could write your own function for doing it:
sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
then run it on your subset selection:
sample.df(mydf[mydf$gender == "F", ], 3)
# gender age
# 5 F 31
# 4 F 29
# 1 F 23
(Personally I find sample.df(subset(mydf, gender == "F"), 3)
easier to read.)
This is now simpler with the enhanced version of sample
in my package:
library(devtools); install_github('kimisc', 'krlmlr')
library(kimisc)
sample.rows(subset(mydf, gender == "F"), 3)
See also this related answer for more detail.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With