Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect browser spoofing and robots from a user agent string in php

So far I am able to detect robots from a list of user agent string by matching these strings to known user agents, but I was wondering what other methods there are to do this using php as I am retrieving fewer bots than expected using this method.

I am also looking to find out how to detect if a browser or robot is spoofing another browser using a user agent string.

Any advice is appreciated.

EDIT: This has to be done using a log file with lines as follows:

129.173.129.168 - - [11/Oct/2011:00:00:05 -0300] "GET /cams/uni_ave2.jpg?time=1318302291289 HTTP/1.1" 200 20240 "http://faculty.dentistry.dal.ca/loanertracker/webcam.html" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US; rv:1.9.2.23) Gecko/20110920 Firefox/3.6.23"

This means I can't check user behaviour aside from access times.

like image 917
user1422508 Avatar asked Nov 14 '12 03:11

user1422508


2 Answers

In addition to filtering key words in the user agent string, I have had luck with putting a hidden honeypot link on all pages:

<a style="display:none" href="autocatch.php">A</a>

Then in "autocatch.php" record the session (or IP address) as a bot. This link is invisible to users but it's hidden characteristic would hopefully not be realized by bots. Taking the style attribute out and putting it into a CSS file might help even more.

like image 171
laifukang Avatar answered Sep 27 '22 19:09

laifukang


Because, as previously stated, you can spoof user-agents & IP, these cannot be used for reliable bot detection.

I work for a security company and our bot detection algorithm look something like this:

  1. Step 1 - Gathering data:

    a. Cross-Check user-agent vs IP. (both need to be right)

    b. Check Header parameters (what is missing, what is the order and etc...)

    c. Check behavior (early access and compliance to robots.txt, general behavior, number of pages visited, visit rates and etc)

  2. Step 2 - Classification:

    By cross verifying the data, the bot is classified as "Good", "Bad" or "Suspicious"

  3. Step 3 - Active Challenges:

    Suspicious bots undergo the following challenges:

    a. JS Challenge (can it activate JS?)

    b. Cookie Challenge (can it accept coockies?)

    c. If still not conclusive -> CAPTCHA

This filtering mechanism is VERY effective but I don't really think it could be replicated by a single person or even an unspecialized provider (for one thing, challenges and bot DB needs to be constantly updated by security team).

We offer some sort of "do it yourself" tools in form of Botopedia.org, our directory that can be used for IP/User-name cross-verification, but for truly efficient solution you will have to rely on specialized services.

There are several free bot monitoring solutions, including our own and most will use the same strategy I've described above (or similar).

GL

like image 27
Igal Zeifman Avatar answered Sep 27 '22 18:09

Igal Zeifman