I'm trying to scrape word definitions, but can't get python to redirect to the correct page. For example, I'm trying to get the definition for the word 'agenesia'. When you load that page in a browser with https://www.lexico.com/definition/agenesia, the page which loads is https://www.lexico.com/definition/agenesis, however in Python the page doesn't redirect and gives a 200 status code
URL = 'https://www.lexico.com/definition/agenesia'
page = requests.head(URL, allow_redirects=True)
This is how I'm currently retrieving the page content, I've also tried using requests.get
but that also doesn't work
EDIT: Because it isn't clear, I'm aware that I could change the word to 'agenesis' in the URL to get the correct page, but I am scraping a list of words and would rather automatically follow the URL rather than searching in a browser for the redirect by hand first.
EDIT 2: I realised it might be easier to check solutions with the rest of my code, so far this works with agenesis
but not agenesia
:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())
Use Python urllib Library To Get Redirection URL. request module. Define a web page URL, suppose this URL will be redirected when you send a request to it. Get the response object. Get the webserver returned response status code, if the code is 301 then it means the URL has been redirected permanently.
To follow redirect with Curl, use the -L or --location command-line option. This flag tells Curl to resend the request to the new address. When you send a POST request, and the server responds with one of the codes 301, 302, or 303, Curl will make the subsequent request using the GET method.
Flask – Redirect & ErrorsFlask class has a redirect() function. When called, it returns a response object and redirects the user to another target location with specified status code. location parameter is the URL where response should be redirected. statuscode sent to browser's header, defaults to 302.
Other answers mentioned before doesn't make your request redirect. The cause is you didn't use the correct request header. Try code below:
import requests
from bs4 import BeautifulSoup
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}
page = requests.get('https://www.lexico.com/definition/agenesia', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
print(page.url)
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())
And print:
https://www.lexico.com/definition/agenesis?s=t
Failure of development, or incomplete development, of a part of the body.
noun
You are doing an HEAD request
The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method.
You want to do
URL = 'https://www.lexico.com/definition/agenesia'
page = requests.get(URL, allow_redirects=True)
If you don't mind a pop-up window, Selenium.py is really good for scraping at a more user-friendly level. If you know the selector of the page element, you can scrape it with driver.find_element_by_css_selector('theselector').text
Where driver = webdriver.chromedriver('file path')
.
This is a pretty radical circumvention of the problem so I understand if it's not applicable to your specific situation but hopefully you find this answer useful. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With