I want to scrape the following website: https://www.coches.net/segunda-mano/.
But every time I open it with python selenium, I get the message that they detected me as a bot.
How can I bypass this detection?
First I tried simple code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Chrome('C:/Python38/chromedriver.exe')
URL = 'https://www.coches.net/segunda-mano/'
browser.get(URL)
Then I tried it with request, but it doesn't work, too.
from selenium import webdriver
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
ua = UserAgent()
headers = {"UserAgent":ua.random}
URL = 'https://www.coches.net/segunda-mano/'
r = requests.get(URL, headers = headers)
print(r.statuscode)
In this case I get the message 403 = Status code stating that access to the URL is prohibited.
Don't know how to get entry to this webpage without getting blocked. I would be very grateful for your help.
Selenium is fairly easily detected, especially by all major anti-bot providers (Cloudflare, Akamai, etc).
Why?
Selenium, and most other major webdrivers set a browser variable (that websites can access) called navigator.webdriver to true. You can check this yourself by heading to your Google Chrome console and running console.log(navigator.webdriver). If you're on a normal browser, it will be false.
The User-Agent, typically all devices have what is called a "user agent", this refers to the device accessing the website. Selenium's User-Agent looks something like this: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/59.0.3071.115 Safari/537.36. Did you catch that? HeadlessChrome is included, this is another route of detection.
These are just two of the multiple ways a Selenium browser can be detected, I would highly recommend reading up on this and this as well.
And lastly, if you want an easy, drop-in solution to bypass detection that implements almost all of these concepts we've talked about, I'd suggest using undetected-chromedriver. This is an open source project that tries it's best to keep your Selenium chromedriver looking human.
I think your problem is not bot detection. You can't use just requests to get the results from that page, because it makes XHR requests behind the scene. So you must use Selenium, splash, etc, but seems is not possible for this case.
However, if you research a bit in the page you can find which url is requested behind the scenes to display the resutls. I did that research and found this page(https://ms-mt--api-web.spain.advgo.net/search), it returns a json, so it will ease your work in terms of parsing. Using chrome dev tools I got the curl request and just map it to python requests and obtain this code:
import json
import requests
headers = {
'authority': 'ms-mt--api-web.spain.advgo.net',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'accept': 'application/json, text/plain, */*',
'x-adevinta-channel': 'web-desktop',
'x-schibsted-tenant': 'coches',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.coches.net',
'sec-fetch-site': 'cross-site',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.coches.net/',
'accept-language': 'en-US,en;q=0.9,es;q=0.8',
}
data = '{"pagination":{"page":1,"size":30},"sort":{"order":"desc","term":"relevance"},"filters":{"categories":{"category1Ids":[2500]},"offerTypeIds":[0,2,3,4,5],"isFinanced":false,"price":{"from":null,"to":null},"year":{"from":null,"to":null},"km":{"from":null,"to":null},"provinceIds":[],"fuelTypeIds":[],"bodyTypeIds":[],"doors":[],"seats":[],"transmissionTypeId":0,"hp":{"from":null,"to":null},"sellerTypeId":0,"hasWarranty":null,"isCertified":false,"luggageCapacity":{"from":null,"to":null},"contractId":0}}'
while True:
response = requests.post('https://ms-mt--api-web.spain.advgo.net/search', headers=headers, data=data).json()
# you should parse items here.
print(response)
if not response["items"]:
break
data_dict = json.loads(data)
data_dict["pagination"]["page"] = data_dict["pagination"]["page"]+1 # get the next page.
data = json.dumps(data_dict)
Probably there are a lot of headers and body info that are unnecesary, you can code-and-test to improve it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With