Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cloudflare and Chromedriver - cloudflare distinguishes between chromedriver and genuine chrome?

I would like to use chromedriver to scrape some stories from fanfiction.net. I try the following:

from selenium import webdriver
import time

path = 'D:\chromedriver\chromedriver.exe'

browser = webdriver.Chrome(path)
url1 = 'https://www.fanfiction.net/s/8832472'
url2 = 'https://www.fanfiction.net/s/5218118'

browser.get(url1)
time.sleep(5)
browser.get(url2)

The first link opens (sometimes I have to wait 5 seconds). When I want to load the second url, cloudflare intervens and wants me to solve captchas - which are not solvable, atleast cloudflare does not recognize this. This happens also, if I enter the links manually in chromedriver (so in the GUI). However, if I do the same things in normal chrome, everything works just as fine (I do not even get the waiting period on the first link) - even in private mode and all cookies deleted. I could reproduce this on several machines. Now my question: To my intuition, chromedriver was just the normal chrome browser which allowed to be controlled. What is the difference to normal chrome, how does Cloudflare distinguish both, and how can I mask my chromedriver as normal chrome? (I do not intend to load many pages in very short time, so it should not look like a bot). I hope my question is clear

like image 808
Tamar Avatar asked Oct 26 '22 15:10

Tamar


1 Answers

This error message...

Checking your browser before accessing

...implies that the Cloudflare have detected your requests to the website as an automated bot and subsequently denying you the access to the application.


Solution

In these cases the a potential solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context.

undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.

  • Code Block:

    import undetected_chromedriver as uc
    from selenium import webdriver
    import time
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    driver = uc.Chrome(options=options)
    url1 = 'https://www.fanfiction.net/s/8832472'
    url2 = 'https://www.fanfiction.net/s/5218118'
    driver.get(url1)
    time.sleep(5)
    driver.get(url2)
    

References

You can find a couple of relevant detailed discussions in:

  • Selenium app redirect to Cloudflare page when hosted on Heroku
  • How to bypass being rate limited ..HTML Error 1015 using Python
like image 55
undetected Selenium Avatar answered Nov 17 '22 07:11

undetected Selenium