Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regexp that matches user-agents of end-user browsers but NOT crawlers with >90 % accuracy

I'm trying to construct a regexp that will evaluate to true for User-Agent:s of "browsers navigated by humans", but false for bots. Needless to say the matching will not be exact, but if it gets things right in say 90 % of cases that is more than good enough.

My approach so far is to target the User-Agent string of the the five major desktop browsers (MSIE, Firefox, Chrome, Safari, Opera). Specifically I want the regexp NOT to match if the user-agent is a bot (Googlebot, msnbot, etc.).

Currently I'm using the following regexp which appears to achieve the desired precision:

^(Mozilla.*(Gecko|KHTML|MSIE|Presto|Trident)|Opera).*$

I've observed small number of false negatives which are mostly mobile browsers. The exceptions all match:

(BlackBerry|HTC|LG|MOT|Nokia|NOKIAN|PLAYSTATION|PSP|SAMSUNG|SonyEricsson)

My question is: Given the desired accuracy level, how would you improve the regexp? Can you think of any major false positives or false negatives to the given regexp?

Please note that the question is specifically about regexp-based User-Agent matching. There are a bunch of other approaches to solving this problem, but those are out of the scope of this question.

like image 877
knorv Avatar asked Mar 24 '10 14:03

knorv


1 Answers

You could construct a blacklist by checking which user agents access robots.txt.

like image 186
Sjoerd Avatar answered Oct 12 '22 06:10

Sjoerd