I'm trying to scrape word definitions, but can't get python to redirect to the correct page. For example, I'm trying to get the definition for the word 'agenesia'. When you load that page in a browser with https://www.lexico.com/definition/agenesia, the page which loads is https://www.lexico.com/definition/agenesis, however in Python the page doesn't redirect and gives a 200 status code <pre class="prettyprint"><code>URL = 'https://www.lexico.com/definition/agenesia' page = requests.head(URL, allow_redirects=True) </code></pre> This is how I'm currently retrieving the page content, I've also tried using <code>requests.get</code> but that also doesn't work EDIT: Because it isn't clear, I'm aware that I could change the word to 'agenesis' in the URL to get the correct page, but I am scraping a list of words and would rather automatically follow the URL rather than searching in a browser for the redirect by hand first. EDIT 2: I realised it might be easier to check solutions with the rest of my code, so far this works with <code>agenesis</code> but not <code>agenesia</code>: <pre class="prettyprint"><code>soup = BeautifulSoup(page.content, 'html.parser') print(soup.find("span", {"class": "ind"}).get_text(), '\n') print(soup.find("span", {"class": "pos"}).get_text()) </code></pre>

You are doing an HEAD request <blockquote> The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method. </blockquote> You want to do <pre class="prettyprint"><code>URL = 'https://www.lexico.com/definition/agenesia' page = requests.get(URL, allow_redirects=True) </code></pre>

If you don't mind a pop-up window, Selenium.py is really good for scraping at a more user-friendly level. If you know the selector of the page element, you can scrape it with <code>driver.find_element_by_css_selector('theselector').text</code> Where <code>driver = webdriver.chromedriver('file path')</code>. This is a pretty radical circumvention of the problem so I understand if it's not applicable to your specific situation but hopefully you find this answer useful. :)

Python requests not redirecting

Tags:

python

python-requests

I'm trying to scrape word definitions, but can't get python to redirect to the correct page. For example, I'm trying to get the definition for the word 'agenesia'. When you load that page in a browser with https://www.lexico.com/definition/agenesia, the page which loads is https://www.lexico.com/definition/agenesis, however in Python the page doesn't redirect and gives a 200 status code

Click to copy

URL = 'https://www.lexico.com/definition/agenesia'
page = requests.head(URL, allow_redirects=True)

This is how I'm currently retrieving the page content, I've also tried using requests.get but that also doesn't work

EDIT: Because it isn't clear, I'm aware that I could change the word to 'agenesis' in the URL to get the correct page, but I am scraping a list of words and would rather automatically follow the URL rather than searching in a browser for the redirect by hand first.

EDIT 2: I realised it might be easier to check solutions with the rest of my code, so far this works with agenesis but not agenesia:

Click to copy

soup = BeautifulSoup(page.content, 'html.parser')

print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())

772

asked May 22 '21 13:05

7koFnMiP

Video Answer

3 Answers

Other answers mentioned before doesn't make your request redirect. The cause is you didn't use the correct request header. Try code below:

Click to copy

import requests
from bs4 import BeautifulSoup

headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
}

page = requests.get('https://www.lexico.com/definition/agenesia', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

print(page.url)
print(soup.find("span", {"class": "ind"}).get_text(), '\n')
print(soup.find("span", {"class": "pos"}).get_text())

And print:

Click to copy

https://www.lexico.com/definition/agenesis?s=t
Failure of development, or incomplete development, of a part of the body. 

noun

answered Oct 29 '22 01:10

jizhihaoSAMA

You are doing an HEAD request

The HTTP HEAD method requests the headers that would be returned if the HEAD request's URL was instead requested with the HTTP GET method.

You want to do

Click to copy

URL = 'https://www.lexico.com/definition/agenesia'
page = requests.get(URL, allow_redirects=True)

answered Oct 29 '22 01:10

Ôrel

If you don't mind a pop-up window, Selenium.py is really good for scraping at a more user-friendly level. If you know the selector of the page element, you can scrape it with driver.find_element_by_css_selector('theselector').text Where driver = webdriver.chromedriver('file path'). This is a pretty radical circumvention of the problem so I understand if it's not applicable to your specific situation but hopefully you find this answer useful. :)

answered Oct 29 '22 01:10

Xpired

Related questions
                            
                                Computing `AB⁻¹` with `np.linalg.solve()`
                            
                                Why can I not assign `cls.__hash__ = id`?
                            
                                Tkinter how to bind to shift+tab
                            
                                3D Gridded Data Interpolation in Julia
                            
                                AttributeError: 'tuple' object has no attribute 'rank' when calling fit on a Keras model with custom generator
                            
                                How to get numpy working properly in Anaconda Python 3.7.6
                            
                                How to scrape all topics from twitter
                            
                                What is a good design pattern to combine datasets that are related but stored in different dataframes?
                            
                                Tensorflow-gpu issue (CUDA runtime error: device kernel image is invalid)
                            
                                Prefect how to avoid rerunning a task
                            
                                Keras - no good way to stop and resume training?
                            
                                Pandas DataFrame filter rows using another DataFrame Column
                            
                                PyTorch - RuntimeError: [enforce fail at inline_container.cc:209] . file not found: archive/data.pkl
                            
                                Python(17874,0x111e92dc0) malloc: can't allocate region
                            
                                Longest path finding with condition
                            
                                Difference in Python thread.join() between Python 3.7 and 3.8
                            
                                torchtext ImportError in colab
                            
                                Google API OAuth 2 sign in something went wrong with new OAuth 2 client
                            
                                How to add individual vlines to every subplot of seaborn FacetGrid
                            
                                Printing Webpage in a Specific Location in Selenium

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python requests not redirecting

Tags:

python

python-requests

7koFnMiP

People also ask

Video Answer

3 Answers

jizhihaoSAMA

Ôrel

Xpired

Recent Activity

Donate For Us