I have a data frame, df2
, containing observations grouped by a ID factor
that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df
:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df
, pos
corresponds to the index of the row that I want to select within the factor level mentioned in ID
, not in the whole dataframe df2
.I'm looking for a way to select the rows for each ID
according to the right index (so their row number within the level of each factor of df2
).
So, in this example, I want to select the first value in df2
with ID == 'A'
, the third value in df2
with ID == 'B'
and the second value in df2
with ID == 'C'
.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by
splits the merged data frame df2m
by df2m$ID
and operates on each part; it returns results in a list, so they must be rbind
ed together at the end. Each subset of the data (associated with each value of ID
) is filtered by pos
and deselects the "pos"
column using normal data.frame syntax.
data.table suggested by @DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df]
-- is the merge. After that, the resulting table is split by = ID
, and each Subset of Data, .SD
is operated on. pos[1L]
is subsetting in the normal way, while !"pos", with=FALSE
corresponds to dropping the pos
column.
See @eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2
is sorted by ID
, you can just do df2$pos <- sequence(table(df2$ID))
for the first line.
Using data.table
version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID
column, then selects the pos
row for each of the rows of df
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With