Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reliably detecting PhantomJS-based spam bots

Tags:

Is there any way to consistently detect PhantomJS/CasperJS? I've been dealing with a spat of malicious spambots built with it and have been able to mostly block them based on certain behaviours, but I'm curious if there's a rock-solid way to know if CasperJS is in use, as dealing with constant adaptations gets slightly annoying.

I don't believe in using Captchas. They are a negative user experience and ReCaptcha has never worked to block spam on my MediaWiki installations. As our site has no user registrations (anonymous discussion board), we'd need to have a Captcha entry for every post. We get several thousand legitimate posts a day and a Captcha would see that number divebomb.

like image 915
Terrakin Avatar asked Dec 31 '13 20:12

Terrakin


People also ask

What is PhantomJS browser?

PhantomJS is a headless browser, meaning a web browser without a graphical user interface, used for automating web page interaction. It is based on WebKit, the web browser engine. So it cannot render web pages but can act like a web browser.


2 Answers

I very much share your take on CAPTCHA. I'll list what I have been able to detect so far, for my own detection script, with similar goals. It's only partial, as they are many more headless browsers.

Fairly safe to use exposed window properties to detect/assume those particular headless browser:

window._phantom (or window.callPhantom) //phantomjs window.__phantomas //PhantomJS-based web perf metrics + monitoring tool  window.Buffer //nodejs window.emit //couchjs window.spawn  //rhino 

The above is gathered from jslint doc and testing with phantom js.

Browser automation drivers (used by BrowserStack or other web capture services for snapshot):

window.webdriver //selenium window.domAutomation (or window.domAutomationController) //chromium based automation driver 

The properties are not always exposed and I am looking into other more robust ways to detect such bots, which I'll probably release as full blown script when done. But that mainly answers your question.

Here is another fairly sound method to detect JS capable headless browsers more broadly:

if (window.outerWidth === 0 && window.outerHeight === 0){ //headless browser } 

This should work well because the properties are 0 by default even if a virtual viewport size is set by headless browsers, and by default it can't report a size of a browser window that doesn't exist. In particular, Phantom JS doesn't support outerWith or outerHeight.

ADDENDUM: There is however a Chrome/Blink bug with outer/innerDimensions. Chromium does not report those dimensions when a page loads in a hidden tab, such as when restored from previous session. Safari doesn't seem to have that issue..

Update: Turns out iOS Safari 8+ has a bug with outerWidth & outerHeight at 0, and a Sailfish webview can too. So while it's a signal, it can't be used alone without being mindful of these bugs. Hence, warning: Please don't use this raw snippet unless you really know what you are doing.

PS: If you know of other headless browser properties not listed here, please share in comments.

like image 56
hexalys Avatar answered Sep 22 '22 10:09

hexalys


There is no rock-solid way: PhantomJS, and Selenium, are just software being used to control browser software, instead of a user controlling it.

With PhantomJS 1.x, in particular, I believe there is some JavaScript you can use to crash the browser that exploits a bug in the version of WebKit being used (it is equivalent to Chrome 13, so very few genuine users should be affected). (I remember this being mentioned on the Phantom mailing list a few months back, but I don't know if the exact JS to use was described.) More generally you could use a combination of user-agent matching up with feature detection. E.g. if a browser claims to be "Chrome 23" but does not have a feature that Chrome 23 has (and that Chrome 13 did not have), then get suspicious.

As a user, I hate CAPTCHAs too. But they are quite effective in that they increase the cost for the spammer: he has to write more software or hire humans to read them. (That is why I think easy CAPTCHAs are good enough: the ones that annoy users are those where you have no idea what it says and have to keep pressing reload to get something you recognize.)

One approach (which I believe Google uses) is to show the CAPTCHA conditionally. E.g. users who are logged-in never get shown it. Users who have already done one post this session are not shown it again. Users from IP addresses in a whitelist (which could be built from previous legitimate posts) are not shown them. Or conversely just show them to users from a blacklist of IP ranges.

I know none of those approaches are perfect, sorry.

like image 26
Darren Cook Avatar answered Sep 19 '22 10:09

Darren Cook