Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference in accessing Cloudflare website using ChromeDriver/Chrome in normal/headless mode through Selenium Python

I have a question about --headless mode in Python Selenium for Chrome.

Code

 from selenium import webdriver
 from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

 CHROME_DRIVER_DIR = "selenium/chromedriver"

 chrome_options = webdriver.ChromeOptions()
 caps = DesiredCapabilities().CHROME
 chrome_options.add_argument("--disable-dev-shm-usage")
 chrome_options.add_argument("--remote-debugging-port=9222")
 chrome_options.add_argument("--headless")  # Runs Chrome in headless mode.
 chrome_options.add_argument('--no-sandbox')  # # Bypass OS security model
 chrome_options.add_argument("--disable-extensions")
 chrome_options.add_argument("--disable-gpu")

 browser = webdriver.Chrome(desired_capabilities=caps, executable_path=CHROME_DRIVER_DIR, options=chrome_options)

 browser.get("https://www.manta.com/c/mm2956g/mashuda-contractors")
 print(browser.page_source)
 browser.quit()

When I'm remove chrome_options.add_argument("--headless") all working good, but with this --headless* got next issue

Please enable cookies.

Error 1020 Ray ID: 53fd62b4087d8116 • 2019-12-04 11:19:28 UTC

Access denied

What happened?
This website is using a security service to protect itself from online attacks.

Cloudflare Ray ID: 53fd62b4087d8116 • Your IP: 168.81.117.111 • Performance & security by Cloudflare

What is the difference for normal mode and --headless?

like image 449
Максим Дихтярь Avatar asked Dec 04 '19 11:12

Максим Дихтярь


People also ask

What is difference between Chrome and Chrome headless?

It provides the ability to control Chrome via external programs. The headless mode can run on servers without the need for dedicated display or graphics. It gets the name headless because it runs without the Graphical User Interface (GUI). The GUI of a program is referred to as the head.

Can Selenium bypass CloudFlare?

Resetting the Selenium Driver is a clever way to bypass CloudFlare detection. After accessing the detection page of CloudFlare using Selenium, the Selenium Driver needs to be reset in order to bypass CloudFlare detection.

Is Chrome and ChromeDriver same?

ChromeDriver uses the same version number scheme as Chrome. See https://www.chromium.org/developers/version-numbers for more details. Each version of ChromeDriver supports Chrome with matching major, minor, and build version numbers. For example, ChromeDriver 73.0.


1 Answers

When scraping CloudFlare protected website, here is the list of things you need to do:

  1. Ensure you are sending headers identical (and in the same order) to what browser sends
  2. Ensure you are using non-datacenter ip address range
  3. And if it still does not work, like in my case...

I encountered the same issue when scraping one ecommerce website (guess dot com). Changing headers order didn't fix it for me. My conclusions: apparently, CloudFlare analyses the TLS fingerprint of the request and throws 403 (1020) code in case the fingerprint matches node.js/python/curl which are usually used for scraping. The solution is to emulate the fingeprint of some popular browser - and the most obvious way would be to use Puppeteer.js with puppeteer extra stealth plugin. And it worked! But.. since Puppeteer was not fast enough for my use case (I put it mildly.. Puppeteer is insane in terms of resources and sluggishness) I had to build an utility which uses boringSSL (the SSL lib used by Chrome) - and since compiling C/C++ code and figuring out the cryptic compilation errors of some TLS library is no fun for most of web devs - I wrapped it as an API server, which you can try here: https://rapidapi.com/restyler/api/scrapeninja

Read more on how CloudFlare analyzes TLS: https://blog.cloudflare.com/monsters-in-the-middleboxes/

like image 200
superjet Avatar answered Oct 20 '22 21:10

superjet