I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by @DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See @eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With