I am trying to download some xml from Pubmed - no problems there, Biopython is great. The problem is that I do not really know how to manipulate the output. I want to put most of the parsed xml into a sql database, but I'm not familiar with the output. For some things I can call the parsed xml like a dictionary, but for others it doesn't seem that straight forward.
from Bio import Entrez
Entrez.email="[email protected]"
import sqlite3 as lite
handle=Entrez.efetch(db='pubmed',id='22737229', retmode='xml')
record = Entrez.read(handle)
If I want to find the title I can do this:
title=record[0]['MedlineCitation']['Article']['ArticleTitle']
But the type of the parsed object is a class:
>>> type(record)
<class 'Bio.Entrez.Parser.ListElement'>
>>>r=record[0]
>>>type(r)
<class 'Bio.Entrez.Parser.DictionaryElement'>
>>> r.keys()
[u'MedlineCitation', u'PubmedData']
Which makes me think there must be a much easier way of doing this than using it as a dictionary. But when I try:
>>> r.MedlineCitation
Traceback (most recent call last):
File "<pyshell#67>", line 1, in <module>
r.MedlineCitation
AttributeError: 'DictionaryElement' object has no attribute 'MedlineCitation'
It doesn't work. I can obviously use it as a dictionary, but then I run into problems later.
The real problem is trying to get certain information from the record when using it like a dictionary:
>>> record[0]['MedlineCitation']['PMID']
StringElement('22737229', attributes={u'Version': u'1'})
Which means that I can't just plop (that's a technical term ;) it into my sql database but need to convert it:
>>> t=record[0]['MedlineCitation']['PMID']
>>> t
StringElement('22737229', attributes={u'Version': u'1'})
>>> int(t)
22737229
>>> str(t)
'22737229'
All in all I am glad for the depth of information that Entrez.read() provides but I am not sure how to easily use the information in the resulting class instance. Usually you can just do things like
record.MedlineCitation
but it doesn't work.
Cheers
Wheaton
The Entrez.read()
method is going to return you a nested data structure, composed of ListElement
s and DictionaryElement
s. For more information, check out the documentation of the read
method in the biopython source which I'll excerpt and paraphrase below:
def read(handle, validate=True):
This function parses an XML file created by NCBI's Entrez Utilities,
returning a multilevel data structure of Python lists and dictionaries.
...
the[se] data structure[s] seem to consist of generic Python lists,
dictionaries, strings, and so on, [but] each of these is actually a class
derived from the base type. This allows us to store the attributes
(if any) of each element in a dictionary my_element.attributes, and
the tag name in my_element.tag.
The author of the package, Michiel de Hoon, also spends some time at the very top of the Parser.py
source file discussing his motivations for representing the XML documents using the custom ListElement
s and DictionaryElement
s in Entrez
.
If you're super curious you can also read the fascinating declarations of the ListElement
, DictionaryElement
, and StructureElement
classes in the source. I'll spoil the surprise and just let you know that they are very light wrappers around their basic Python datatypes, and behave almost exactly the same as their underlying basic datatypes, except they have a new property, attributes
, which captures the XML attributes (keys and values) for each XML node in the document that read
is parsing.
So the basic answer to your question is that there is no "easy" way of using dot-operator syntax to address the keys of a DictionaryElement
. If you have a dictionary element d, such that:
>>> d
DictElement({'first_name': 'Russell', 'last_name': 'Jones'}, attributes={'occupation': 'entertainer'})
The only built-in way you can read the first_name
is by using the normal python dictionary API, for instance:
>>> d['first_name']
'Russell'
>>> d.get('first_name')
'Russell'
>>> d.get('middle_name', 'No Middle Name')
'No Middle Name'
Don't lose heart, this really isn't so bad. If you want to take certain nodes and insert them into rows of a sqlite database, you can just write small methods that take a DictElement as input, and return something sqlite can accept as output. If you're having trouble with this, feel free to post another question specifically about that.
I'm not sure if this is right, but i do believe that the 'record' is a list of dictionaries. So you need to get each dictionary using a loop
Something like
for r in record:
r['MedlineCitation']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With