Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assigning Family to Species Based on Dataset Attributes

Tags:

r

dplyr

I have a dataset with thousands of lines and the following columns: ID, parentID, rank, and scientificName.

I wish to create a new column that will inform the family (a level in rank) a given species belong to. If anyone could help, it would be greatly appreciated.

Example data:

ID = c('f1','f2','g1','g2','g3','g4','s1','s2','s3','s4','s5','s6') # all unique
parentID = c(NA,NA,'f1','f1','f2','f2','g1','g1','g2','g3','g3','g4')
rank = c('family','family','genus','genus','genus','genus','species','species','species','species','species','species')
scientificName = c('FamA','FamB','GenA','GenB','GenC','GenD','SpA','SpB','SpC','SpD','SpE','SpF')
dat = data.frame( ID, parentID, rank, scientificName)

My desired output (in this example) would be an extra column informing the families as: family = c('famA','famB','famA','famA','famB','famB','famA','famA','famA','famB','famB','famB')

I've thought about creating vectors of families and their IDs, then changing codes in the ParentID column by family names, and then trying something similar for the genus to ultimately 'link' family info with each species, but it got kinda messy in the end (that is, it didn't work). I think what I need can be accomplished through 'dplyr' package, but I'm stuck... Again, I'd appreciate any help.

like image 347
Jhonny Guedes Avatar asked Dec 07 '25 07:12

Jhonny Guedes


2 Answers

This is a good problem for recursion. Here's a vectorized base R solution.

find_family <- function(ID, parentID, scientificName) {
  find_family_id <- function(ID, parentID) {
    ID_new <- ifelse(!is.na(parentID), parentID, ID)
    parentID_new <- parentID[match(ID_new, ID)]
    if (all(is.na(parentID_new))) return(ID_new)
    find_family_id(ID_new, parentID_new)
  }
  family_ids <- find_family_id(ID, parentID)
  scientificName[match(family_ids, ID)]
}

dat$family <- with(dat, find_family(ID, parentID, scientificName))

dat
#    ID parentID    rank scientificName family
# 1  f1     <NA>  family           FamA   FamA
# 2  f2     <NA>  family           FamB   FamB
# 3  g1       f1   genus           GenA   FamA
# 4  g2       f1   genus           GenB   FamA
# 5  g3       f2   genus           GenC   FamB
# 6  g4       f2   genus           GenD   FamB
# 7  s1       g1 species            SpA   FamA
# 8  s2       g1 species            SpB   FamA
# 9  s3       g2 species            SpC   FamA
# 10 s4       g3 species            SpD   FamB
# 11 s5       g3 species            SpE   FamB
# 12 s6       g4 species            SpF   FamB
like image 139
zephryl Avatar answered Dec 08 '25 20:12

zephryl


You could write a small recursive function to look up the ID until it gets to the family rank. Then apply this function using some iterator (purrr::map_chr here to ensure a character vector):

library(dplyr)
library(purrr)

get_family <- function(x) {
  i <- match(x, ID)
  with(dat, if (rank[i]=="family") scientificName[i] else get_family(parentID[i]))
}

dat |>
  mutate(family = map_chr(ID, get_family))

Output

   ID parentID    rank scientificName family
1  f1     <NA>  family           FamA   FamA
2  f2     <NA>  family           FamB   FamB
3  g1       f1   genus           GenA   FamA
4  g2       f1   genus           GenB   FamA
5  g3       f2   genus           GenC   FamB
6  g4       f2   genus           GenD   FamB
7  s1       g1 species            SpA   FamA
8  s2       g1 species            SpB   FamA
9  s3       g2 species            SpC   FamA
10 s4       g3 species            SpD   FamB
11 s5       g3 species            SpE   FamB
12 s6       g4 species            SpF   FamB
like image 37
LMc Avatar answered Dec 08 '25 20:12

LMc