Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Workaround for blocked GET requests in Python

I'm trying to retrieve and process the results of a web search using requests and beautifulsoup.

I've written some simple code to do the job, and it returns successfully (status = 200), but the content of the request is just an error message "We're sorry for any inconvenience, but the site is currently unavailable.", and has been the same for the last several days. Searching within Firefox returns results without issue, however. I've run the code using a URL for the UK-based site and it works without issue so I wonder if the US site is set up to block attempts to scrape web searches.

Are there ways to mask the fact I'm attempting to retrieve search results from within Python (eg, masquerading as a standard search within Firefox) or some other work around to allow access to the search results?

Code included for reference below:

import pandas as pd
from requests import get
import bs4 as bs
import re
# works
# baseURL = 'https://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=ky119sb&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&make=TOYOTA&model=VERSO&year-from=1990&year-to=2017&minimum-mileage=0&maximum-mileage=200000&body-type=MPV&fuel-type=Diesel&minimum-badge-engine-size=1.6&maximum-badge-engine-size=4.5&maximum-seats=8'
# doesn't work
baseURL = 'https://www.autotrader.com/cars-for-sale/Certified+Cars/cars+under+50000/Jeep/Grand+Cherokee/Seattle+WA-98101?extColorsSimple=BURGUNDY%2CRED%2CWHITE&maxMileage=45000&makeCodeList=JEEP&listingTypes=CERTIFIED%2CUSED&interiorColorsSimple=BEIGE%2CBROWN%2CBURGUNDY%2CTAN&searchRadius=0&modelCodeList=JEEPGRAND&trimCodeList=JEEPGRAND%7CSRT%2CJEEPGRAND%7CSRT8&zip=98101&maxPrice=50000&startYear=2015&marketExtension=true&sortBy=derivedpriceDESC&numRecords=25&firstRecord=0'
a = get(baseURL)
soup = bs.BeautifulSoup(a.content,'html.parser')

info = soup.find_all('div', class_ = 'information-container')
price = soup.find_all('div', class_ = 'vehicle-price')

d = [] 
for idx, i in enumerate(info):
    ii = i.find_next('ul').find_all('li')

    year_ = ii[0].text
    miles = re.sub("[^0-9\.]", "", ii[2].text)
    engine = ii[3].text
    hp = re.sub("[^\d\.]", "", ii[4].text)
    p = re.sub("[^\d\.]", "", price[idx].text)

    d.append([year_, miles, engine, hp, p])

df = pd.DataFrame(d, columns=['year','miles','engine','hp','price']) 
like image 752
Chris Avatar asked Jul 23 '19 01:07

Chris


People also ask

Is Python request blocking?

Like urllib2 , requests is blocking. But I wouldn't suggest using another library, either. The simplest answer is to run each request in a separate thread. Unless you have hundreds of them, this should be fine.

Can websites block Python requests?

The reason why request might be blocked is that, for example in Python requests library, default user-agent is python-requests and websites understands that it's a bot and might block a request in order to protect the website from overload, if there's a lot of requests being sent. Check what's your user-agent .

How does Python requests get work?

It works as a request-response protocol between a client and a server. Let's demonstrate how to make a GET request to an endpoint. GET method is used to retrieve information from the given server using a given URI. The GET method sends the encoded user information appended to the page request.


1 Answers

By default, Requests sends a unique user agent when making requests.

>>> r = requests.get('https://google.com')
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

It is possible that the website you are using is trying to avoid scrapers by denying any request with a user agent of python-requests.

To get around this, you can change your user agent when sending a request. Since it's working on your browser, simply copy your browser user agent (you can Google it, or record a request to a webpage and copy your user agent like that). For me, it's Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 (what a mouthful), so I'd set my user agent like this:

>>> headers = {
...     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
... }

and then send the request with the new headers (the new headers are added to the default headers, they don't replace them unless they have the same name):

>>> r = requests.get('https://google.com', headers=headers)  # Using the custom headers we defined above
>>> r.request.headers
{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Now we can see that the request was sent with our preferred headers, and hopefully the site won't be able to tell the difference between Requests and a browser.

like image 62
Michael Kolber Avatar answered Oct 19 '22 17:10

Michael Kolber