Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I download all the abstract datas from the pubmed data ncbi

Tags:

pubmed

ncbi

I want to download all the pubmed data abstracts. Does anyone know how I can easily download all of the pubmed article abstracts?

I got the source of the data : ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/af/12/

Is there anyway to download all these tar files..

Thanks in advance.

like image 214
Soundarya Thiagarajan Avatar asked Nov 04 '15 10:11

Soundarya Thiagarajan


2 Answers

There is a package called rentrezhttps://ropensci.org/packages/. Check this out. You can retrieve abstracts by specific keywords or PMID etc. I hope it helps.

UPDATE: You can download all the abstracts by passing your list of IDS with the following code.

    library(rentrez)
    library(xml)

your.ids <- c("26386083","26273372","26066373","25837167","25466451","25013473")
# rentrez function to get the data from pubmed db
fetch.pubmed <- entrez_fetch(db = "pubmed", id = your.ids,
                      rettype = "xml", parsed = T)
# Extract the Abstracts for the respective IDS.  
abstracts = xpathApply(fetch.pubmed, '//PubmedArticle//Article', function(x)
                               xmlValue(xmlChildren(x)$Abstract))
# Change the abstract names with the IDS.
names(abstracts) <- your.ids
abstracts
col.abstracts <- do.call(rbind.data.frame,abstracts)
dim(col.abstracts)
write.csv(col.abstracts, file = "test.csv")
like image 180
user5249203 Avatar answered Sep 19 '22 21:09

user5249203


I appreciate that this is a somewhat old question.

If you wish to get all the pubmed entries with python I wrote the following script a while ago:

import requests
import json

search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&mindate=1800/01/01&maxdate=2016/12/31&usehistory=y&retmode=json"
search_r = requests.post(search_url)
search_data = search_r.json()
webenv = search_data["esearchresult"]['webenv']
total_records = int(search_data["esearchresult"]['count'])
fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmax=9999&query_key=1&webenv="+webenv

for i in range(0, total_records, 10000):
    this_fetch = fetch_url+"&retstart="+str(i)
    print("Getting this URL: "+this_fetch)
    fetch_r = requests.post(this_fetch)
    f = open('pubmed_batch_'+str(i)+'_to_'+str(i+9999)+".json", 'w')
    f.write(fetch_r.text)
    f.close()

print("Number of records found :"+str(total_records))

It starts of by making an entrez/eutils search request between 2 dates which can be guaranteed to capture all of pubmed. Then from that response the 'webenv' (which saves the search history) and total_records are retrieved. Using the webenv capability saves having to hand the individual record ids to the efetch call.

Fetching records (efetch) can only be done in batches of 10000, the for loop handles grabbing batches of 9,999 records and saving them in labelled files until all the records are retrieved.

Note that requests can fail (non 200 http responses, errors), in a more robust solution you should wrap each requests.post() in a try/except. And before dumping/using the data to file you should ensure that the http response has a 200 status.

like image 24
DanB Avatar answered Sep 17 '22 21:09

DanB