This has got to be a simple answer. I want to subset my data for testing purposes. I have a data frame where I want to keep all columns of information, just simply reduce the number of observations PER individual. So, I have a unique Identifier and about 50 individuals. I want to select only 2 individuals AND and I want to select only 500 data points from those 2.
My data frame is called wloc08
. There are 50 unique IDs. I am only taking 2 of those individuals but of those 2, I'd like only 500 data points from each.
subwloc08=subset(wloc08, subset = ID %in% c("F07001","F07005"))
somewhere in this statement can I use [
?
reduced= subwloc08$ID[1:500,]
Doesn't work.
If you're only dealing with 2 individuals, you could get away with subsetting each separately and then rbind
ing each subset:
wloc08F07001 <- wloc08[which(wloc08$ID == "F07001")[1:500], ]
wloc08F07005 <- wloc08[which(wloc08$ID == "F07005")[1:500], ]
reduced <- rbind(wloc08F07001, wloc08F07005)
To make this more generalizable, especially if you are dealing with large amounts of data, you might consider looking at the data.table
package. Here is an example
library(data.table)
wloc08DT<-as.data.table(wloc08) # Create data.table
setkey(wloc08DT, "ID") # Set a key to subset on
# EDIT: A comment from Matthew Dowle pointed out that by = "ID" isn't necessary
# reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500], by = "ID"]
reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500]]
To break down the syntax of the last step:
c("F07001", "F07005")
: This will subset your data by finding all rows where the key is equal to F07001
or F07005
. It will also instigate "by without by" (see ?data.table
for details)
.SD[1:500]
: This will subset the .SD
object (the subsetted data.table) by selecting rows 1:500.
EDIT This part was removed thanks to a correction by Matthew Dowle. The "by without by" is initiated by step 1. Formerly: (by = "ID"
: This tells [.data.table
to perform the operation in step 2 for each ID individually, in this case only the IDs that you indicated in step 1.)
You could use lapply
:
do.call("rbind",
lapply(c("F07001", "F07005"),
function(x) wloc08[which(wloc08$ID == x)[1:500], ]))
Your command reduced = subwloc08$ID[1:500,]
didn't work since subwloc08$ID
is a vector. However, reduced = subwloc08$ID[1:500]
would have worked but would have returned the first 500 values of subwloc08$ID
(not the whole rows of subwloc08
).
If you want to run this command for the first 30 subjects, you could use unique(wloc08$ID)[1:30]
instead of c("F07001", "F07005")
:
do.call("rbind",
lapply(unique(wloc08$ID)[1:30],
function(x) wloc08[which(wloc08$ID == x)[1:500], ]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With