Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't go on to the next page using post request

I've written a script in python to get different links leading to different articles from a webpage. Upon running my script I can get them flawlessly. However, the problem I'm facing is that the article links traverse multiple pages as they are of big numbers to fit within a single page. if I click on the next page button, the attached information i can see in the developer tools which in reality produce an ajax call through post request. As there are no links attached to that next page button, I can't find any way to go on to the next page and parse links from there. I've tried with a post request with that formdata but it doesn't seem to work. Where am I going wrong?

Link to the landing page containing articles

This is the information I get using chrome dev tools when I click on the next page button:

GENERAL
=======================================================
Request URL: https://www.ncbi.nlm.nih.gov/pubmed/
Request Method: POST
Status Code: 200 OK
Remote Address: 130.14.29.110:443
Referrer Policy: origin-when-cross-origin

RESPONSE HEADERS
=======================================================
Cache-Control: private
Connection: Keep-Alive
Content-Encoding: gzip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: text/html; charset=UTF-8
Date: Fri, 29 Jun 2018 10:27:42 GMT
Keep-Alive: timeout=1, max=9
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03.m_8
NCBI-SID: CE8C479DB3510951_0083SID
Referrer-Policy: origin-when-cross-origin
Server: Apache
Set-Cookie: ncbi_sid=CE8C479DB3510951_0083SID; domain=.nih.gov; path=/; expires=Sat, 29 Jun 2019 10:27:42 GMT
Set-Cookie: WebEnv=1Jqk9ZOlyZSMGjHikFxNDsJ_ObuK0OxHkidgMrx8vWy2g9zqu8wopb8_D9qXGsLJQ9mdylAaDMA_T-tvHJ40Sq_FODOo33__T-tAH%40CE8C479DB3510951_0083SID; domain=.nlm.nih.gov; path=/; expires=Fri, 29 Jun 2018 18:27:42 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Transfer-Encoding: chunked
Vary: Accept-Encoding
X-UA-Compatible: IE=Edge
X-XSS-Protection: 1; mode=block

REQUEST HEADERS
========================================================
Accept: text/html, */*; q=0.01
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Connection: keep-alive
Content-Length: 395
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Cookie: ncbi_sid=CE8C479DB3510951_0083SID; _ga=GA1.2.1222765292.1530204312; _gid=GA1.2.739858891.1530204312; _gat=1; WebEnv=18Kcapkr72VVldfGaODQIbB2bzuU50uUwU7wrUi-x-bNDgwH73vW0M9dVXA_JOyukBSscTE8Qmd1BmLAi2nDUz7DRBZpKj1wuA_QB%40CE8C479DB3510951_0083SID; starnext=MYGwlsDWB2CmAeAXAXAbgA4CdYDcDOsAhpsABZoCu0IA9oQCZxLJA===
Host: www.ncbi.nlm.nih.gov
NCBI-PHID: 396E3400B36089610000000000C6005E.m_12.03
Origin: https://www.ncbi.nlm.nih.gov
Referer: https://www.ncbi.nlm.nih.gov/pubmed
User-Agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36
X-Requested-With: XMLHttpRequest

FORM DATA
========================================================
p$l: AjaxServer
portlets: id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity
load: yes

This is my script so far (the get request is working flawlessly if uncommented, but for the first page):

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"

# res = requests.get(geturl,headers={"User-Agent":"Mozilla/5.0"})
# soup = BeautifulSoup(res.text,"lxml")
# for items in soup.select("div.rslt p.title a"):
#     print(items.get("href"))

FormData={
    'p$l': 'AjaxServer',
    'portlets': 'id=relevancesortad:sort=;id=timelinead:blobid=NCID_1_120519284_130.14.22.215_9001_1530267709_1070655576_0MetA0_S_MegaStore_F_1:yr=:term=%222015%22%5BDate%20-%20Publication%5D%20%3A%20%223000%22%5BDate%20-%20Publication%5D;id=reldata:db=pubmed:querykey=1;id=searchdetails;id=recentactivity',
    'load': 'yes'
    }

req = requests.post(posturl,data=FormData,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("div.rslt p.title a"):
    print(items.get("href"))

Btw, the url in the browser becomes "https://www.ncbi.nlm.nih.gov/pubmed" when I click on the next page link.

I don't wish to go for any solution related to any browser simulator. Thanks in advance.

like image 476
SIM Avatar asked Jun 29 '18 11:06

SIM


1 Answers

The content is heavily dynamic, so it would be best to use selenium or similar clients, but I realize that this wouldn't be practical as the number of results is so large. So, we'll have to analyse the HTTP requests submitted by the browser and simulate them with requests.

The contents of next page are loaded by POST request to /pubmed, and the post data are the input fields of the EntrezForm form. The form submission is controlled by js (triggered when 'next page' button is clicked), and is preformed with the .submit() method.

After some examination I discovered some interesting fields:

  • EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage and
    EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage indicate the current and next page.

  • EntrezSystem2.PEntrez.DbConnector.Cmd seems to preform a database query. If we don't submit this field the results won't change.

  • EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PageSize and EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PrevPageSize indicate the number of results per page.

With that information I was able to get multiple pages with the script below.

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"

s = requests.session()
s.headers["User-Agent"] = "Mozilla/5.0"

soup = BeautifulSoup(s.get(geturl).text,"lxml")
inputs = {i['name']: i.get('value', '') for i in soup.select('form#EntrezForm input[name]')}

results = int(inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_ResultsController.ResultCount'])
items_per_page = 100
pages = results // items_per_page + int(bool(results % items_per_page))

inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PageSize'] = items_per_page
inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_DisplayBar.PrevPageSize'] = items_per_page
inputs['EntrezSystem2.PEntrez.DbConnector.Cmd'] = 'PageChanged'

links = []

for page in range(pages):
    inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.CurrPage'] = page + 1
    inputs['EntrezSystem2.PEntrez.PubMed.Pubmed_ResultsPanel.Pubmed_Pager.cPage'] = page

    res = s.post(posturl, inputs)
    soup = BeautifulSoup(res.text, "lxml")

    items = [i['href'] for i in soup.select("div.rslt p.title a[href]")]
    links += items

    for i in items:
        print(i)

I'm requesting 100 items per page because higher numbers seem to 'break' the server, but you should be able to adjust that number with some error checking.

Finally, the links are displayed in descending order (/29960282, /29960281, ...), so I thought we could calculate the links without preforming any POST requests:

geturl = "https://www.ncbi.nlm.nih.gov/pubmed/?term=%222015%22%5BDate+-+Publication%5D+%3A+%223000%22%5BDate+-+Publication%5D"
posturl = "https://www.ncbi.nlm.nih.gov/pubmed/"

s = requests.session()
s.headers["User-Agent"] = "Mozilla/5.0"
soup = BeautifulSoup(s.get(geturl).text,"lxml")

results = int(soup.select_one('[name$=ResultCount]')['value'])
first_link = int(soup.select_one("div.rslt p.title a[href]")['href'].split('/')[-1])
last_link = first_link - results

links = [posturl + str(i) for i in range(first_link, last_link, -1)]

But unfortunately the results are not accurate.

like image 157
t.m.adam Avatar answered Nov 14 '22 05:11

t.m.adam