I am attempting to alter a previous script that utilizes biopython to fetch information about a species phylum. This script was written to retrieve information one species at a time. I would like to modify the script so that I can do this for 100 organisms at a time. Here is the initial code
import sys
from Bio import Entrez
def get_tax_id(species):
"""to get data from ncbi taxomomy, we need to have the taxid. we can
get that by passing the species name to esearch, which will return
the tax id"""
species = species.replace(" ", "+").strip()
search = Entrez.esearch(term = species, db = "taxonomy", retmode = "xml")
record = Entrez.read(search)
return record['IdList'][0]
def get_tax_data(taxid):
"""once we have the taxid, we can fetch the record"""
search = Entrez.efetch(id = taxid, db = "taxonomy", retmode = "xml")
return Entrez.read(search)
Entrez.email = ""
if not Entrez.email:
print "you must add your email address"
sys.exit(2)
taxid = get_tax_id("Erodium carvifolium")
data = get_tax_data(taxid)
lineage = {d['Rank']:d['ScientificName'] for d in
data[0]['LineageEx'] if d['Rank'] in ['family', 'order']}
I have managed to modify the script so that it accepts a local file that contains one of the organisms I am using. But I need to extend this to a 100 organisms.
So the idea was to generate a list from the file of my organisms and somehow separately fed each item generated from the list into the line taxid = get_tax_id("Erodium carvifolium")
and replace "Erodium carvifolium" with my organisms name. But I have no idea how to do that.
Here is the sample version of the code with some of my adjustments
import sys
from Bio import Entrez
def get_tax_id(species):
"""to get data from ncbi taxomomy, we need to have the taxid. we can
get that by passing the species name to esearch, which will return
the tax id"""
species = species.replace(' ', "+").strip()
search = Entrez.esearch(term = species, db = "taxonomy", retmode = "xml")
record = Entrez.read(search)
return record['IdList'][0]
def get_tax_data(taxid):
"""once we have the taxid, we can fetch the record"""
search = Entrez.efetch(id = taxid, db = "taxonomy", retmode = "xml")
return Entrez.read(search)
Entrez.email = ""
if not Entrez.email:
print "you must add your email address"
sys.exit(2)
list = ['Helicobacter pylori 26695', 'Thermotoga maritima MSB8', 'Deinococcus radiodurans R1', 'Treponema pallidum subsp. pallidum str. Nichols', 'Aquifex aeolicus VF5', 'Archaeoglobus fulgidus DSM 4304']
i = iter(list)
item = i.next()
for item in list:
???
taxid = get_tax_id(?)
data = get_tax_data(taxid)
lineage = {d['Rank']:d['ScientificName'] for d in
data[0]['LineageEx'] if d['Rank'] in ['phylum']}
print lineage, taxid
The question marks refer to places where I am stumped as what to do next. I don't see how I can connect my loop to replace the ? in get_tax_id(?). Or do I need to somehow append each of the items in the list so that they are modified each time to contain get_tax_id(Helicobacter pylori 26695)
and then find some way to place them in the line containing taxid =
Here's what you need, place this below your function definitions, i.e. after the line that says: sys.exit(2)
species_list = ['Helicobacter pylori 26695', 'Thermotoga maritima MSB8', 'Deinococcus radiodurans R1', 'Treponema pallidum subsp. pallidum str. Nichols', 'Aquifex aeolicus VF5', 'Archaeoglobus fulgidus DSM 4304']
taxid_list = [] # Initiate the lists to store the data to be parsed in
data_list = []
lineage_list = []
print('parsing taxonomic data...') # message declaring the parser has begun
for species in species_list:
print ('\t'+species) # progress messages
taxid = get_tax_id(species) # Apply your functions
data = get_tax_data(taxid)
lineage = {d['Rank']:d['ScientificName'] for d in data[0]['LineageEx'] if d['Rank'] in ['phylum']}
taxid_list.append(taxid) # Append the data to lists already initiated
data_list.append(data)
lineage_list.append(lineage)
print('complete!')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With