Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Browser Automation with Selenium: Fingerprints, recognizability and traceability?

I want to use selenium/webdriver to simulate a browser and scrape some website-content with it. Even if its not the fastest method, for me it has many advantages such as executing scripts etc.

For many websites it is forbidden to access them via an automated method, for example search engines like google or bing.

For one tool i need to scrape the estimated resultstat from google for several keywords. This will look like the following: simulate the browser that visits google.com and types in a keyword and scrapes the results, then after a little pause type in the next keyword, scrape the results and so on...

My question is: Is it possible for a website to recognize that I'm using selenium to simulate the browser instead of using the browser by hand? Especially the google case gives me some doubts. I know selenium is partly developed by google or at least by some guys working for google. So does leave selenium some fingerprints or isn't it possible to decide if I'm using the browser by myself or simulated by selenium, even for google?

like image 611
zwieback86 Avatar asked Oct 04 '22 13:10

zwieback86


1 Answers

No, nobody can actually see that you're using Selenium and not hand-operating the browser yourself with WebDriver. I'm not sure about the old Selenium RC, but it should be the same way. Here's how it works:

  1. Selenium opens up a browser with a clean profile (or with a profile you selected)
  2. Selenium is hooked up to the browser so it can steer it, control it. But the browser still does most of the work. Basically, Selenium replaces the user inputs to the browser, but not more.

You can easily verify this by reading the contents of the HTTP headers sent by your browser.

If you ever actually needed Selenium to be recognized by your server, you can use Browsermob-proxy and add a custom header to your requests.


All that said, there is one thing you must be aware of. While there's no way to detect Selenium directly, there can be some indirect clues picked up by the website you're visiting. Those usually include scanning for too many requests made in virtually no time - this might be an issue for you. Make sure your Selenium is behaving like a user.


EDIT 2016/04:

Apparanetly it is possible as https://stackoverflow.com/a/33403473/2930045 states that a company can do it. My guess - and it is nothing but a guess - is that they can run some JS that Selenium installs into the browser to operate.

like image 124
Petr Janeček Avatar answered Oct 11 '22 17:10

Petr Janeček