Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloading all full-text articles in PMC and PubMed databases

According to one of the answered questions by NCBI Help Desk , we cannot "bulk-download" PubMed Central. However, can I use "NCBI E-utilities" to download all full-text papers in PMC database using Efetch or at least find all corresponding PMCids using Esearch in Entrez Programming Utilities? If yes, then how? If E-utilities cannot be used, is there any other way to download all full-text articles?

like image 372
s.22217330 Avatar asked Nov 14 '17 19:11

s.22217330


1 Answers

First of all, before you go downloading files in bulk, I highly recommend you read the E-utilities usage guidelines.

If you want full-text articles, you're going to want to limit your search to open access files. Furthermore, I suggest also restricting your search to Medline articles if you want articles that are any good. Then you can do the search.

Using Biopython, this gives us :

search_query = 'medline[sb] AND "open access"[filter]'

# getting search results for the query
search_results = Entrez.read(Entrez.esearch(db="pmc", term=search_query, retmax=10, usehistory="y"))

You can use the search function on the PMC website and it will display the generated query that you can copy/paste into your code. Now that you've done the search, you can actually download the files :

handle = Entrez.efetch(db="pmc", rettype="full", retmode="xml", retstart=0, retmax=int(search_results["Count"]), webenv=search_results["WebEnv"], query_key=search_results["QueryKey"])
  • You might want to download in batches by changing retstart and retmax by variables in a loop in order to avoid flooding the servers.
  • If handle contains only one file, handle.read() contains the whole XML file as a string. If it contains more, the articles are contained in <article></article> nodes.
  • The full text is only available in XML, and the default parser available in pubmed doesn't handle XML namespaces, so you're going to be on your own with ElementTree (or an other parser) to parse your XML.
  • Here, the articles are found thanks to the internal history of E-utilities, which is accessed with the webenv argument and enabled thanks to the usehistory="y" argument in Entrez.read()

A few tips about XML parsing with ElementTree : You can't delete a grandchild node, so you're probably going to want to delete some nodes recursively. node.text returns the text in node, but only up to the first child, so you'll need to do something along the lines of "".join(node.itertext()) if you want to get all the text in a given node.

like image 97
Syncrossus Avatar answered Oct 01 '22 07:10

Syncrossus