Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting scientific names [closed]

A scientific name usually consists of 3 pieces of information: Genus, species epitheton and Author. A simple example would be the following:

Acanthus ilicifolius L.

  • Genus: Acanthus
  • Species epitheton: ilicifolious
  • Author: L.

Easy. However, the matter gets more complicated when we have to deal with hybrids, subspecies/varieties/forma, several authors and other inconsistencies. In these cases, a species name might look like this:

cf. Andrographis paniculata (Burm.f.) Wall. ex Nees

  • cf.: the species was not determined with 100% certainty
  • Genus: Andrographis
  • Species epitheton: paniculata
  • Author: (Burm.f.) Wall. ex Nees

or this:

Ipomoea pes-caprae (L.) DC. subsp. brasiliensis (L.) Ooststr.f

  • Genus: Ipomea
  • Species epitheton: pes-caprae
  • Species author: (L.) DC.
  • Subspecies epitheton: brasiliensis
  • Subspecies Author: (L.) Ooststr.f

I'm trying to find a reliable way to deconstruct such names. I could write some hackish code using tons if if/else statements but I'm looking for something more elegant (and robust). I was thinking of some kind of parser that parses the name similarly to a calculator parsing a mathematical expression. Unfortunately, I'm not the most sophisticated programmer and neither have I written a real parser before, nor do I know if it would make sense in this case, as there is quite a lot of variation in scientific names. What do you think is the best way to tackle this problem? Preferred language is R, perhaps also Julia if it suits the task better.

like image 712
ChrKoenig Avatar asked Jan 21 '15 16:01

ChrKoenig


1 Answers

You're in luck (kind of). GBIF have a name parser, and the taxize package hooks into its API with the gbif_parse function.

library(taxize)
gbif_parse(c('Acanthus ilicifolius L.', 
             'cf. Andrographis paniculata (Burm.f.) Wall. ex Nees', 
             'Ipomoea pes-caprae (L.) DC. subsp. brasiliensis (L.) Ooststr.f'))

#                                                   scientificname       type genusorabove specificepithet authorsparsed    authorship                   canonicalname                canonicalnamewithmarker                                 canonicalnamecomplete bracketauthorship infraspecificepithet  rankmarker
# 1                                        Acanthus ilicifolius L. WELLFORMED     Acanthus     ilicifolius          TRUE            L.            Acanthus ilicifolius                   Acanthus ilicifolius                               Acanthus ilicifolius L.              <NA>                 <NA>       <NA>
# 2            cf. Andrographis paniculata (Burm.f.) Wall. ex Nees   INFORMAL Andrographis      paniculata          TRUE Wall. ex Nees         Andrographis paniculata                Andrographis paniculata      Andrographis paniculata (Burm. f.) Wall. ex Nees          Burm. f.                 <NA>       <NA>
# 3 Ipomoea pes-caprae (L.) DC. subsp. brasiliensis (L.) Ooststr.f    SCINAME      Ipomoea      pes-caprae          TRUE     Ooststr.f Ipomoea pes-caprae brasiliensis Ipomoea pes-caprae subsp. brasiliensis Ipomoea pes-caprae subsp. brasiliensis (L.) Ooststr.f                L.         brasiliensis     subsp.

See ?gbif_parse for more info. You can also find GBIF on github.

taxize also takes advantage of the EOL API - see ?gni_parse.

like image 77
jbaums Avatar answered Oct 19 '22 20:10

jbaums