I have a question about --headless
mode in Python Selenium for Chrome.
Code
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
CHROME_DRIVER_DIR = "selenium/chromedriver"
chrome_options = webdriver.ChromeOptions()
caps = DesiredCapabilities().CHROME
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--remote-debugging-port=9222")
chrome_options.add_argument("--headless") # Runs Chrome in headless mode.
chrome_options.add_argument('--no-sandbox') # # Bypass OS security model
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-gpu")
browser = webdriver.Chrome(desired_capabilities=caps, executable_path=CHROME_DRIVER_DIR, options=chrome_options)
browser.get("https://www.manta.com/c/mm2956g/mashuda-contractors")
print(browser.page_source)
browser.quit()
When I'm remove chrome_options.add_argument("--headless")
all working good, but with this --headless*
got next issue
Please enable cookies.
Error 1020 Ray ID: 53fd62b4087d8116 • 2019-12-04 11:19:28 UTC
Access denied
What happened?
This website is using a security service to protect itself from online attacks.
Cloudflare Ray ID: 53fd62b4087d8116 • Your IP: 168.81.117.111 • Performance & security by Cloudflare
What is the difference for normal mode and --headless
?
It provides the ability to control Chrome via external programs. The headless mode can run on servers without the need for dedicated display or graphics. It gets the name headless because it runs without the Graphical User Interface (GUI). The GUI of a program is referred to as the head.
Resetting the Selenium Driver is a clever way to bypass CloudFlare detection. After accessing the detection page of CloudFlare using Selenium, the Selenium Driver needs to be reset in order to bypass CloudFlare detection.
ChromeDriver uses the same version number scheme as Chrome. See https://www.chromium.org/developers/version-numbers for more details. Each version of ChromeDriver supports Chrome with matching major, minor, and build version numbers. For example, ChromeDriver 73.0.
When scraping CloudFlare protected website, here is the list of things you need to do:
I encountered the same issue when scraping one ecommerce website (guess dot com). Changing headers order didn't fix it for me. My conclusions: apparently, CloudFlare analyses the TLS fingerprint of the request and throws 403 (1020) code in case the fingerprint matches node.js/python/curl which are usually used for scraping. The solution is to emulate the fingeprint of some popular browser - and the most obvious way would be to use Puppeteer.js with puppeteer extra stealth plugin. And it worked! But.. since Puppeteer was not fast enough for my use case (I put it mildly.. Puppeteer is insane in terms of resources and sluggishness) I had to build an utility which uses boringSSL (the SSL lib used by Chrome) - and since compiling C/C++ code and figuring out the cryptic compilation errors of some TLS library is no fun for most of web devs - I wrapped it as an API server, which you can try here: https://rapidapi.com/restyler/api/scrapeninja
Read more on how CloudFlare analyzes TLS: https://blog.cloudflare.com/monsters-in-the-middleboxes/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With