Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python requests not redirecting

I'm trying to scrape word definitions, but can't get python to redirect to the correct page. For example, I'm trying to get the definition for the word 'agenesia'. When you load that page in a browser with https://www.lexico.com/definition/agenesia, the page which loads is https://www.lexico.com/definition/agenesis, however in Python the page doesn't redirect and gives a 200 status code

URL = 'https://www.lexico.com/definition/agenesia'
page = requests.head(URL, allow_redirects=True)

This is how I'm currently retrieving the page content, I've also tried using requests.get but that also doesn't work

EDIT: Because it isn't clear, I'm aware that I could change the word to 'agenesis' in the URL to get the correct page, but I am scraping a list of words and would rather automatically follow the URL rather than searching in a browser for the redirect by hand first.

EDIT 2: I realised it might be easier to check solutions with the rest of my code, so far this works with agenesis but not agenesia:

soup = BeautifulSoup(page.content, 'html.parser')

print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())
like image 772
7koFnMiP Avatar asked May 22 '21 13:05

7koFnMiP


People also ask

How do I redirect a URL in Python?

Use Python urllib Library To Get Redirection URL. request module. Define a web page URL, suppose this URL will be redirected when you send a request to it. Get the response object. Get the webserver returned response status code, if the code is 301 then it means the URL has been redirected permanently.

How do I follow curl redirect?

To follow redirect with Curl, use the -L or --location command-line option. This flag tells Curl to resend the request to the new address. When you send a POST request, and the server responds with one of the codes 301, 302, or 303, Curl will make the subsequent request using the GET method.

How do you redirect in flask?

Flask – Redirect & ErrorsFlask class has a redirect() function. When called, it returns a response object and redirects the user to another target location with specified status code. location parameter is the URL where response should be redirected. statuscode sent to browser's header, defaults to 302.


Video Answer


3 Answers

Other answers mentioned before doesn't make your request redirect. The cause is you didn't use the correct request header. Try code below:

import requests
from bs4 import BeautifulSoup

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}

page = requests.get('https://www.lexico.com/definition/agenesia', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

print(page.url)
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())

And print:

https://www.lexico.com/definition/agenesis?s=t
Failure of development, or incomplete development, of a part of the body. 

noun
like image 84
jizhihaoSAMA Avatar answered Oct 29 '22 01:10

jizhihaoSAMA


You are doing an HEAD request

The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method.

You want to do

URL = 'https://www.lexico.com/definition/agenesia'
page = requests.get(URL, allow_redirects=True)
like image 24
Ôrel Avatar answered Oct 29 '22 01:10

Ôrel


If you don't mind a pop-up window, Selenium.py is really good for scraping at a more user-friendly level. If you know the selector of the page element, you can scrape it with driver.find_element_by_css_selector('theselector').text Where driver = webdriver.chromedriver('file path'). This is a pretty radical circumvention of the problem so I understand if it's not applicable to your specific situation but hopefully you find this answer useful. :)

like image 1
Xpired Avatar answered Oct 29 '22 01:10

Xpired