Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Webpage Is Detecting Selenium Webdriver with Chromedriver as a bot

I am trying to scrape https://www.controller.com/ with python, and since the page detected a bot using pandas.get_html, and requests using user-agents and a rotating proxy, i resorted to using selenium webdriver. However, this is also being detected as a bot with the following message. Can anybody explain how can I get past this?:

Pardon Our Interruption... As you were browsing www.controller.com something about your browser made us think you were a bot. There are a few reasons this might happen: You're a power user moving through this website with super-human speed. You've disabled JavaScript in your web browser. A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article. To request an unblock, please fill out the form below and we will review it as soon as possible"

Here is my code:

from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)
like image 951
Alex Avatar asked Mar 04 '19 13:03

Alex


3 Answers

You have mentioned about pandas.get_html only in your question and options.add_argument('headless') only in your code so not sure if you are implementing them. However taking out minimum code from your code attempt as follows:

  • Code Block:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get('https://www.controller.com/')
    print(driver.title)
    

I have faced the same issue.

  • Browser Snashot:

controller_com

When I inspected the HTML DOM it was observed that the website refers the distil_referrer on window.onbeforeunload as follows:

<script type="text/javascript" id="">
    window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>

Snapshot:

onbeforeunload

This is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.


Distil

As per the article There Really Is Something About Distil.it...:

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

Further,

"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


Reference

You can find a couple of detailed discussion in:

  • Distil detects WebDriver driven Chrome Browsing Context
  • Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
  • Akamai Bot Manager detects WebDriver driven Chrome Browsing Context
like image 173
undetected Selenium Avatar answered Oct 10 '22 04:10

undetected Selenium


Finally solved the problem and headless mode works as well.

chrome_options = Options()
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--headless")

driver = webdriver.Chrome("chromedriver.exe", options=chrome_options)

driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
like image 28
Raphael Avatar answered Oct 10 '22 05:10

Raphael


Distil can detect if you are headless by doing some fingerprinting using html5 canvas. They also check things like browser plugins and user-agent. Selenium sets some browser flags that are also detectable.

like image 36
pguardiario Avatar answered Oct 10 '22 05:10

pguardiario