Subset first 500 rows by group, for a subset of groups

Question

This has got to be a simple answer. I want to subset my data for testing purposes. I have a data frame where I want to keep all columns of information, just simply reduce the number of observations PER individual. So, I have a unique Identifier and about 50 individuals. I want to select only 2 individuals AND and I want to select only 500 data points from those 2.

My data frame is called wloc08. There are 50 unique IDs. I am only taking 2 of those individuals but of those 2, I'd like only 500 data points from each.

subwloc08=subset(wloc08, subset = ID %in% c("F07001","F07005"))

somewhere in this statement can I use [?

 reduced= subwloc08$ID[1:500,]

Doesn't work.

BenBarnes · Accepted Answer

If you're only dealing with 2 individuals, you could get away with subsetting each separately and then rbinding each subset:

wloc08F07001 <- wloc08[which(wloc08$ID == "F07001")[1:500], ]

wloc08F07005 <- wloc08[which(wloc08$ID == "F07005")[1:500], ]

reduced <- rbind(wloc08F07001, wloc08F07005)

To make this more generalizable, especially if you are dealing with large amounts of data, you might consider looking at the data.table package. Here is an example

library(data.table)

wloc08DT<-as.data.table(wloc08)  # Create data.table

setkey(wloc08DT, "ID")           # Set a key to subset on

# EDIT: A comment from Matthew Dowle pointed out that by = "ID" isn't necessary
# reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500], by = "ID"]
reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500]]

To break down the syntax of the last step:

c("F07001", "F07005"): This will subset your data by finding all rows where the key is equal to F07001 or F07005. It will also instigate "by without by" (see ?data.table for details)
.SD[1:500]: This will subset the .SD object (the subsetted data.table) by selecting rows 1:500.
EDIT This part was removed thanks to a correction by Matthew Dowle. The "by without by" is initiated by step 1. Formerly: (by = "ID": This tells [.data.table to perform the operation in step 2 for each ID individually, in this case only the IDs that you indicated in step 1.)

Sven Hohenstein · Answer

You could use lapply:

do.call("rbind",
        lapply(c("F07001", "F07005"),
               function(x) wloc08[which(wloc08$ID == x)[1:500], ]))

Your command reduced = subwloc08$ID[1:500,] didn't work since subwloc08$ID is a vector. However, reduced = subwloc08$ID[1:500] would have worked but would have returned the first 500 values of subwloc08$ID (not the whole rows of subwloc08).

If you want to run this command for the first 30 subjects, you could use unique(wloc08$ID)[1:30] instead of c("F07001", "F07005"):

do.call("rbind",
        lapply(unique(wloc08$ID)[1:30],
               function(x) wloc08[which(wloc08$ID == x)[1:500], ]))

Subset first 500 rows by group, for a subset of groups

Tags:

r

data.table

subset

Kerry

2 Answers

BenBarnes

Sven Hohenstein

Recent Activity

Donate For Us

Subset first 500 rows by group, for a subset of groups

Tags:

r

data.table

subset

Kerry

2 Answers

BenBarnes

Sven Hohenstein

Related questions

Recent Activity

Donate For Us