Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select row by level of a factor

Tags:

r

r-factor

subset

I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:

df <- data.frame(ID = c("A","B","C"),
                 pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
                  obs = c(1:15))

In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).

So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.

This would then give me:

df3 <- data.frame(ID = c("A", "B", "C"),
                  obs = c(1, 8, 12))
like image 521
Chris. Z Avatar asked Aug 25 '15 17:08

Chris. Z


3 Answers

dplyr

library(dplyr)

merge(df,df2) %>% 
  group_by(ID) %>% 
  filter(row_number() == pos) %>%
  select(-pos)

#   ID obs
# 1  A   1
# 2  B   8
# 3  C  12

base R

df2m <- merge(df,df2)
do.call(rbind, 
  by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")]) 
)

by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.

data.table suggested by @DavidArenburg in a comment

library(data.table)

setkey(setDT(df2),"ID")[df][, 
  .SD[pos[1L], !"pos", with=FALSE]
, by = ID]

The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.

See @eddi's answer for a better data.table approach.

like image 106
Frank Avatar answered Oct 27 '22 22:10

Frank


Here's the base R solution:

df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
  ID pos obs
1  A   1   1
2  B   3   8
3  C   2  12

If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.

like image 22
user295691 Avatar answered Oct 27 '22 23:10

user295691


Using data.table version 1.9.5+:

setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']

which merges on ID column, then selects the pos row for each of the rows of df.

like image 26
eddi Avatar answered Oct 27 '22 22:10

eddi