Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Subset first 500 rows by group, for a subset of groups

This has got to be a simple answer. I want to subset my data for testing purposes. I have a data frame where I want to keep all columns of information, just simply reduce the number of observations PER individual. So, I have a unique Identifier and about 50 individuals. I want to select only 2 individuals AND and I want to select only 500 data points from those 2.

My data frame is called wloc08. There are 50 unique IDs. I am only taking 2 of those individuals but of those 2, I'd like only 500 data points from each.

subwloc08=subset(wloc08, subset = ID %in% c("F07001","F07005"))

somewhere in this statement can I use [?

 reduced= subwloc08$ID[1:500,]

Doesn't work.

like image 818
Kerry Avatar asked Jan 16 '23 13:01

Kerry


2 Answers

If you're only dealing with 2 individuals, you could get away with subsetting each separately and then rbinding each subset:

wloc08F07001 <- wloc08[which(wloc08$ID == "F07001")[1:500], ]

wloc08F07005 <- wloc08[which(wloc08$ID == "F07005")[1:500], ]

reduced <- rbind(wloc08F07001, wloc08F07005)

To make this more generalizable, especially if you are dealing with large amounts of data, you might consider looking at the data.table package. Here is an example

library(data.table)

wloc08DT<-as.data.table(wloc08)  # Create data.table

setkey(wloc08DT, "ID")           # Set a key to subset on

# EDIT: A comment from Matthew Dowle pointed out that by = "ID" isn't necessary
# reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500], by = "ID"]
reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500]]

To break down the syntax of the last step:

  1. c("F07001", "F07005"): This will subset your data by finding all rows where the key is equal to F07001 or F07005. It will also instigate "by without by" (see ?data.table for details)

  2. .SD[1:500]: This will subset the .SD object (the subsetted data.table) by selecting rows 1:500.

  3. EDIT This part was removed thanks to a correction by Matthew Dowle. The "by without by" is initiated by step 1. Formerly: (by = "ID": This tells [.data.table to perform the operation in step 2 for each ID individually, in this case only the IDs that you indicated in step 1.)

like image 188
BenBarnes Avatar answered Jan 26 '23 04:01

BenBarnes


You could use lapply:

do.call("rbind",
        lapply(c("F07001", "F07005"),
               function(x) wloc08[which(wloc08$ID == x)[1:500], ]))

Your command reduced = subwloc08$ID[1:500,] didn't work since subwloc08$ID is a vector. However, reduced = subwloc08$ID[1:500] would have worked but would have returned the first 500 values of subwloc08$ID (not the whole rows of subwloc08).

If you want to run this command for the first 30 subjects, you could use unique(wloc08$ID)[1:30] instead of c("F07001", "F07005"):

do.call("rbind",
        lapply(unique(wloc08$ID)[1:30],
               function(x) wloc08[which(wloc08$ID == x)[1:500], ]))
like image 30
Sven Hohenstein Avatar answered Jan 26 '23 04:01

Sven Hohenstein