What is the difference in accessing Cloudflare website using ChromeDriver/Chrome in normal/headless mode through Selenium Python

Tags:

I have a question about --headless mode in Python Selenium for Chrome.

Code

 from selenium import webdriver
 from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

 CHROME_DRIVER_DIR = "selenium/chromedriver"

 chrome_options = webdriver.ChromeOptions()
 caps = DesiredCapabilities().CHROME
 chrome_options.add_argument("--disable-dev-shm-usage")
 chrome_options.add_argument("--remote-debugging-port=9222")
 chrome_options.add_argument("--headless")  # Runs Chrome in headless mode.
 chrome_options.add_argument('--no-sandbox')  # # Bypass OS security model
 chrome_options.add_argument("--disable-extensions")
 chrome_options.add_argument("--disable-gpu")

 browser = webdriver.Chrome(desired_capabilities=caps, executable_path=CHROME_DRIVER_DIR, options=chrome_options)

 browser.get("https://www.manta.com/c/mm2956g/mashuda-contractors")
 print(browser.page_source)
 browser.quit()

When I'm remove chrome_options.add_argument("--headless") all working good, but with this --headless* got next issue

Please enable cookies.

Error 1020 Ray ID: 53fd62b4087d8116 • 2019-12-04 11:19:28 UTC

Access denied

What happened?
This website is using a security service to protect itself from online attacks.

Cloudflare Ray ID: 53fd62b4087d8116 • Your IP: 168.81.117.111 • Performance & security by Cloudflare

What is the difference for normal mode and --headless?

449

asked Dec 04 '19 11:12

Максим Дихтярь

1 Answers

When scraping CloudFlare protected website, here is the list of things you need to do:

Ensure you are sending headers identical (and in the same order) to what browser sends
Ensure you are using non-datacenter ip address range
And if it still does not work, like in my case...

I encountered the same issue when scraping one ecommerce website (guess dot com). Changing headers order didn't fix it for me. My conclusions: apparently, CloudFlare analyses the TLS fingerprint of the request and throws 403 (1020) code in case the fingerprint matches node.js/python/curl which are usually used for scraping. The solution is to emulate the fingeprint of some popular browser - and the most obvious way would be to use Puppeteer.js with puppeteer extra stealth plugin. And it worked! But.. since Puppeteer was not fast enough for my use case (I put it mildly.. Puppeteer is insane in terms of resources and sluggishness) I had to build an utility which uses boringSSL (the SSL lib used by Chrome) - and since compiling C/C++ code and figuring out the cryptic compilation errors of some TLS library is no fun for most of web devs - I wrapped it as an API server, which you can try here: https://rapidapi.com/restyler/api/scrapeninja

Read more on how CloudFlare analyzes TLS: https://blog.cloudflare.com/monsters-in-the-middleboxes/

200

answered Oct 20 '22 21:10

superjet

Related questions
                            
                                Removing dupes in list of lists in Python
                            
                                Convert DatetimeIndex to datetime.date in pandas
                            
                                Cache decorator for numpy arrays
                            
                                Blur a specific part of an image
                            
                                Python: Method .as_matrix will be removed in a future version. Use .values instead [duplicate]
                            
                                Pythonic way of collapsing/grouping a list to aggregating max/min
                            
                                TypeError: cannot unpack non-iterable int object in Django views function
                            
                                How to Reverse Sort a nested list starting with Uppercase entries?
                            
                                Trouble Installing TA-Lib in Python 3.7
                            
                                Connect to SFTP with key file using Python pysftp
                            
                                ImportError: cannot import name 'transfer_markers' when testing with pytest
                            
                                How to add dummies to Pandas DataFrame?
                            
                                PyMongo cursor batch_size
                            
                                Emacs and conda workaround
                            
                                How to display training progress bar in tensorflow?
                            
                                How to remove consecutive identical words from a string in python
                            
                                How to make lightweight docker image for python app with pipenv
                            
                                How to skip task in Airflow operator?
                            
                                How to resize a PyTorch tensor?
                            
                                Lossy conversion from float64 to uint8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference in accessing Cloudflare website using ChromeDriver/Chrome in normal/headless mode through Selenium Python

Tags:

python

selenium

selenium-chromedriver

cloudflare

google-chrome-headless

Максим Дихтярь

People also ask

1 Answers

superjet

Recent Activity

Donate For Us