I've written a script in python to get different links leading to different articles from a webpage. Upon running my script I can get them flawlessly. However, the problem I'm facing is that the article links traverse multiple pages as they are of big numbers to fit within a single page. if I click on the next page button, the attached information i can see in the developer tools which in reality produce an ajax call through post request. As there are no links attached to that next page button, I can't find any way to go on to the next page and parse links from there. I've tried with a post request
with that formdata
but it doesn't seem to work. Where am I going wrong?
Link to the landing page containing articles
This is the information I get using chrome dev tools when I click on the next page button:
GENERAL
=======================================================
Request URL: https://www.ncbi.nlm.nih.gov/pubmed/
Request Method: POST
Status Code: 200 OK
Remote Address: 130.14.29.110:443
Referrer Policy: origin-when-cross-origin
RESPONSE HEADERS
=======================================================
Cache-Control: private
Connection: Keep-Alive
Content-Encoding: gzip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: text/html; charset=UTF-8
Date: Fri, 29 Jun 2018 10:27:42 GMT
Keep-Alive: timeout=1, max=9
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03.m_8
NCBI-SID: CE8C479DB3510951_0083SID
Referrer-Policy: origin-when-cross-origin
Server: Apache
Set-Cookie: ncbi_sid=CE8C479DB3510951_0083SID; domain=.nih.gov; path=/; expires=Sat, 29 Jun 2019 10:27:42 GMT
Set-Cookie: WebEnv=1Jqk9ZOlyZSMGjHikFxNDsJ_ObuK0OxHkidgMrx8vWy2g9zqu8wopb8_D9qXGsLJQ9mdylAaDMA_T-tvHJ40Sq_FODOo33__T-tAH%40CE8C479DB3510951_0083SID; domain=.nlm.nih.gov; path=/; expires=Fri, 29 Jun 2018 18:27:42 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block
REQUEST HEADERS
========================================================
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Content-Length: 395
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: ncbi_sid=CE8C479DB3510951_0083SID; _ga=GA1.2.1222765292.1530204312; _gid=GA1.2.739858891.1530204312; _gat=1; WebEnv=18Kcapkr72VVldfGaODQIbB2bzuU50uUwU7wrUi-x-bNDgwH73vW0M9dVXA_JOyukBSscTE8Qmd1BmLAi2nDUz7DRBZpKj1wuA_QB%40CE8C479DB3510951_0083SID; starnext=MYGwlsDWB2CmAeAXAXAbgA4CdYDcDOsAhpsABZoCu0IA9oQCZxLJA===
Host: www.ncbi.nlm.nih.gov
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03
Origin: https://www.ncbi.nlm.nih.gov
Referer: https://www.ncbi.nlm.nih.gov/pubmed
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
X-Requested-With: XMLHttpRequest
FORM DATA
========================================================
p$l: AjaxServer
portlets: id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity
load: yes
This is my script so far (the get request is working flawlessly if uncommented, but for the first page):
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
# res = requests.get(geturl,headers={"User-Agent":"Mozilla/5.0"})
# soup = BeautifulSoup(res.text,"lxml")
# for items in soup.select("div.rslt p.title a"):
# print(items.get("href"))
FormData={
'p$l': 'AjaxServer',
'portlets': 'id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity',
'load': 'yes'
}
req = requests.post(posturl,data=FormData,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("div.rslt p.title a"):
print(items.get("href"))
Btw, the url in the browser becomes "https://www.ncbi.nlm.nih.gov/pubmed" when I click on the next page link.
I don't wish to go for any solution related to any browser simulator. Thanks in advance.
The content is heavily dynamic, so it would be best to use selenium
or similar clients, but I realize that this wouldn't be practical as the number of results is so large. So, we'll have to analyse the HTTP requests submitted by the browser and simulate them with requests
.
The contents of next page are loaded by POST request to /pubmed
, and the post data are the input fields of the EntrezForm
form. The form submission is controlled by js (triggered when 'next page' button is clicked), and is preformed with the .submit()
method.
After some examination I discovered some interesting fields:
EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage
andEntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage
indicate the current and next page.
EntrezSystem2.PEntrez.DbConnector.Cmd
seems to preform a database query. If we don't submit this field the results won't change.
EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PageSize
and
EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PrevPageSize
indicate the number of results per page.
With that information I was able to get multiple pages with the script below.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
s = requests.session()
s.headers["User-Agent"] = "Mozilla/5.0"
soup = BeautifulSoup(s.get(geturl).text,"lxml")
inputs = {i['name']: i.get('value', '') for i in soup.select('form#EntrezForm input[name]')}
results = int(inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_ResultsController.ResultCount'])
items_per_page = 100
pages = results // items_per_page + int(bool(results % items_per_page))
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PageSize'] = items_per_page
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PrevPageSize'] = items_per_page
inputs['EntrezSystem2.PEntrez.DbConnector.Cmd'] = 'PageChanged'
links = []
for page in range(pages):
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage'] = page + 1
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage'] = page
res = s.post(posturl, inputs)
soup = BeautifulSoup(res.text, "lxml")
items = [i['href'] for i in soup.select("div.rslt p.title a[href]")]
links += items
for i in items:
print(i)
I'm requesting 100 items per page because higher numbers seem to 'break' the server, but you should be able to adjust that number with some error checking.
Finally, the links are displayed in descending order (/29960282
, /29960281
, ...), so I thought we could calculate the links without preforming any POST requests:
geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"
s = requests.session()
s.headers["User-Agent"] = "Mozilla/5.0"
soup = BeautifulSoup(s.get(geturl).text,"lxml")
results = int(soup.select_one('[name$=ResultCount]')['value'])
first_link = int(soup.select_one("div.rslt p.title a[href]")['href'].split('/')[-1])
last_link = first_link - results
links = [posturl + str(i) for i in range(first_link, last_link, -1)]
But unfortunately the results are not accurate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With