Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering/accessing date in Bio Entrez pubmed pulls with python

I have a list of criteria (names and date ranges of when papers were published) to obtain a list of published papers. I'm using Biopython's Bio Entrez to obtain papers from Entrez.

I can query and get results by author name but I'm not figuring out how to deal with the data to get the dates in there. This is what I've done:

handle = Entrez.esearch(db="pubmed", term = "" )
result = Entrez.read(handle)
handle.close()
ids = result['IdList']
print ids
#for each ids go through it and pull the summary
for uid in ids:
     handle2 = Entrez.esummary(db="pubmed", id=uid, retmode= "xml")
     result2 = Entrez.read(handle2)
     handle2.close()

Now the output looks like this

 [{'DOI': '10.1016/j.jmoldx.2013.10.002', 'Title': 'Validation of a next-generation sequencing assay for clinical molecular oncology.', 'Source': 'J Mol Diagn', 'PmcRefCount': 7, 'Issue': '1', 'SO': '2014 Jan;16(1):89-105', 'ISSN': '1525-1578', 'Volume': '16', 'FullJournalName': 'The Journal of molecular diagnostics : JMD', 'RecordStatus': 'PubMed - indexed for MEDLINE', 'ESSN': '1943-7811', 'ELocationID': 'doi: 10.1016/j.jmoldx.2013.10.002', 'Pages': '89-105', 'PubStatus': 'ppublish+epublish', 'AuthorList': ['Cottrell CE', 'Al-Kateb H', 'Bredemeyer AJ', 'Duncavage EJ', 'Spencer DH', 'Abel HJ', 'Lockwood CM', 'Hagemann IS', "O'Guin SM", 'Burcea LC', 'Sawyer CS', 'Oschwald DM', 'Stratman JL', 'Sher DA', 'Johnson MR', 'Brown JT', 'Cliften PF', 'George B', 'McIntosh LD', 'Shrivastava S', 'Nguyen TT', 'Payton JE', 'Watson MA', 'Crosby SD', 'Head RD', 'Mitra RD', 'Nagarajan R', 'Kulkarni S', 'Seibert K', 'Virgin HW 4th', 'Milbrandt J', 'Pfeifer JD'], 'EPubDate': '2013 Nov 6', 'PubDate': '2014 Jan', 'NlmUniqueID': '100893612', 'LastAuthor': 'Pfeifer JD', 'ArticleIds': {'pii': 'S1525-1578(13)00219-5', 'medline': [], 'pubmed': ['24211365'], 'eid': '24211365', 'rid': '24211365', 'doi': '10.1016/j.jmoldx.2013.10.002'}, u'Item': [], 'History': {'received': '2013/02/04 00:00', 'medline': ['2014/08/30 06:00'], 'revised': '2013/08/23 00:00', 'pubmed': ['2013/11/12 06:00'], 'aheadofprint': '2013/11/06 00:00', 'accepted': '2013/10/01 00:00', 'entrez': '2013/11/12 06:00'}, 'LangList': ['English'], 'HasAbstract': 1, 'References': ['J Mol Diagn. 2014 Jan;16(1):7-10. PMID: 24269227'], 'PubTypeList': ['Journal Article'], u'Id': '24211365'}]

I tried looking at using Efetch which doesn't always have xml output from what I understand. I thought I could filter for dates by parsing through xml as so

proj_start = '2009 Jan 01'
proj_start = time.strptime(proj_start, '%Y %b %d')
for paper in results2:
    handle = open(paper)
    record = Entrez.read(handle)
    pub_dat=time.strptime(record["EPubDate"], '%Y %b %d')  

I get the error: Traceback (most recent call last):

   File "<ipython-input-39-13bcded12392>", line 2, in <module>
    handle = open(paper)

  TypeError: coercing to Unicode: need string or buffer, ListElement found

I feel like I'm missing something and I should be able to feed this directly into the query. I also don't understand why this method doesn't work even though it seems a harder way to do this. Is there a better way to do this? I tried to do this using xml.etree but I also got a similar like error.

like image 359
Jacob Ian Avatar asked May 21 '26 02:05

Jacob Ian


1 Answers

You don't need to open(paper) : paper is a already Python dict (basically JSON). If you want the accepted date you can access it like this:

paper['History']['accepted']
'2013/10/01 00:00'
like image 95
maxymoo Avatar answered May 22 '26 15:05

maxymoo