Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Biopython class instance - output from Entrez.read: I don't know how to manipulate the output

I am trying to download some xml from Pubmed - no problems there, Biopython is great. The problem is that I do not really know how to manipulate the output. I want to put most of the parsed xml into a sql database, but I'm not familiar with the output. For some things I can call the parsed xml like a dictionary, but for others it doesn't seem that straight forward.

from Bio import Entrez
Entrez.email="[email protected]"
import sqlite3 as lite
handle=Entrez.efetch(db='pubmed',id='22737229', retmode='xml')
record = Entrez.read(handle)

If I want to find the title I can do this:

title=record[0]['MedlineCitation']['Article']['ArticleTitle']

But the type of the parsed object is a class:

>>> type(record)
<class 'Bio.Entrez.Parser.ListElement'>
>>>r=record[0]
>>>type(r)
<class 'Bio.Entrez.Parser.DictionaryElement'>
>>> r.keys()
[u'MedlineCitation', u'PubmedData']

Which makes me think there must be a much easier way of doing this than using it as a dictionary. But when I try:

>>> r.MedlineCitation

Traceback (most recent call last):
  File "<pyshell#67>", line 1, in <module>
    r.MedlineCitation
AttributeError: 'DictionaryElement' object has no attribute 'MedlineCitation'

It doesn't work. I can obviously use it as a dictionary, but then I run into problems later.

The real problem is trying to get certain information from the record when using it like a dictionary:

>>> record[0]['MedlineCitation']['PMID']
StringElement('22737229', attributes={u'Version': u'1'})

Which means that I can't just plop (that's a technical term ;) it into my sql database but need to convert it:

>>> t=record[0]['MedlineCitation']['PMID']
>>> t
StringElement('22737229', attributes={u'Version': u'1'})
>>> int(t)
22737229
>>> str(t)
'22737229'

All in all I am glad for the depth of information that Entrez.read() provides but I am not sure how to easily use the information in the resulting class instance. Usually you can just do things like

record.MedlineCitation

but it doesn't work.

Cheers

Wheaton

like image 237
Wheaton Little Avatar asked Jul 04 '12 04:07

Wheaton Little


2 Answers

The Entrez.read() method is going to return you a nested data structure, composed of ListElements and DictionaryElements. For more information, check out the documentation of the read method in the biopython source which I'll excerpt and paraphrase below:

def read(handle, validate=True):

This function parses an XML file created by NCBI's Entrez Utilities,
returning a multilevel data structure of Python lists and dictionaries.
...
the[se] data structure[s] seem to consist of generic Python lists,
dictionaries, strings, and so on, [but] each of these is actually a class
derived from the base type. This allows us to store the attributes
(if any) of each element in a dictionary my_element.attributes, and
the tag name in my_element.tag.

The author of the package, Michiel de Hoon, also spends some time at the very top of the Parser.py source file discussing his motivations for representing the XML documents using the custom ListElements and DictionaryElements in Entrez.

If you're super curious you can also read the fascinating declarations of the ListElement, DictionaryElement, and StructureElement classes in the source. I'll spoil the surprise and just let you know that they are very light wrappers around their basic Python datatypes, and behave almost exactly the same as their underlying basic datatypes, except they have a new property, attributes, which captures the XML attributes (keys and values) for each XML node in the document that read is parsing.

So the basic answer to your question is that there is no "easy" way of using dot-operator syntax to address the keys of a DictionaryElement. If you have a dictionary element d, such that:

>>> d
DictElement({'first_name': 'Russell', 'last_name': 'Jones'}, attributes={'occupation': 'entertainer'})

The only built-in way you can read the first_name is by using the normal python dictionary API, for instance:

>>> d['first_name']
'Russell'
>>> d.get('first_name')
'Russell'
>>> d.get('middle_name', 'No Middle Name')
'No Middle Name'

Don't lose heart, this really isn't so bad. If you want to take certain nodes and insert them into rows of a sqlite database, you can just write small methods that take a DictElement as input, and return something sqlite can accept as output. If you're having trouble with this, feel free to post another question specifically about that.

like image 52
prairiedogg Avatar answered Nov 10 '22 02:11

prairiedogg


I'm not sure if this is right, but i do believe that the 'record' is a list of dictionaries. So you need to get each dictionary using a loop

Something like

for r in record:
    r['MedlineCitation']
like image 21
Nick Avatar answered Nov 10 '22 01:11

Nick