I have a dataset with thousands of lines and the following columns: ID, parentID, rank, and scientificName.
I wish to create a new column that will inform the family (a level in rank) a given species belong to. If anyone could help, it would be greatly appreciated.
Example data:
ID = c('f1','f2','g1','g2','g3','g4','s1','s2','s3','s4','s5','s6') # all unique
parentID = c(NA,NA,'f1','f1','f2','f2','g1','g1','g2','g3','g3','g4')
rank = c('family','family','genus','genus','genus','genus','species','species','species','species','species','species')
scientificName = c('FamA','FamB','GenA','GenB','GenC','GenD','SpA','SpB','SpC','SpD','SpE','SpF')
dat = data.frame( ID, parentID, rank, scientificName)
My desired output (in this example) would be an extra column informing the families as: family = c('famA','famB','famA','famA','famB','famB','famA','famA','famA','famB','famB','famB')
I've thought about creating vectors of families and their IDs, then changing codes in the ParentID column by family names, and then trying something similar for the genus to ultimately 'link' family info with each species, but it got kinda messy in the end (that is, it didn't work). I think what I need can be accomplished through 'dplyr' package, but I'm stuck... Again, I'd appreciate any help.
This is a good problem for recursion. Here's a vectorized base R solution.
find_family <- function(ID, parentID, scientificName) {
find_family_id <- function(ID, parentID) {
ID_new <- ifelse(!is.na(parentID), parentID, ID)
parentID_new <- parentID[match(ID_new, ID)]
if (all(is.na(parentID_new))) return(ID_new)
find_family_id(ID_new, parentID_new)
}
family_ids <- find_family_id(ID, parentID)
scientificName[match(family_ids, ID)]
}
dat$family <- with(dat, find_family(ID, parentID, scientificName))
dat
# ID parentID rank scientificName family
# 1 f1 <NA> family FamA FamA
# 2 f2 <NA> family FamB FamB
# 3 g1 f1 genus GenA FamA
# 4 g2 f1 genus GenB FamA
# 5 g3 f2 genus GenC FamB
# 6 g4 f2 genus GenD FamB
# 7 s1 g1 species SpA FamA
# 8 s2 g1 species SpB FamA
# 9 s3 g2 species SpC FamA
# 10 s4 g3 species SpD FamB
# 11 s5 g3 species SpE FamB
# 12 s6 g4 species SpF FamB
You could write a small recursive function to look up the ID until it gets to the family rank. Then apply this function using some iterator (purrr::map_chr here to ensure a character vector):
library(dplyr)
library(purrr)
get_family <- function(x) {
i <- match(x, ID)
with(dat, if (rank[i]=="family") scientificName[i] else get_family(parentID[i]))
}
dat |>
mutate(family = map_chr(ID, get_family))
Output
ID parentID rank scientificName family
1 f1 <NA> family FamA FamA
2 f2 <NA> family FamB FamB
3 g1 f1 genus GenA FamA
4 g2 f1 genus GenB FamA
5 g3 f2 genus GenC FamB
6 g4 f2 genus GenD FamB
7 s1 g1 species SpA FamA
8 s2 g1 species SpB FamA
9 s3 g2 species SpC FamA
10 s4 g3 species SpD FamB
11 s5 g3 species SpE FamB
12 s6 g4 species SpF FamB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With