Regexp that matches user-agents of end-user browsers but NOT crawlers with 90 % accuracy

Question

I'm trying to construct a regexp that will evaluate to true for User-Agent:s of "browsers navigated by humans", but false for bots. Needless to say the matching will not be exact, but if it gets things right in say 90 % of cases that is more than good enough.

My approach so far is to target the User-Agent string of the the five major desktop browsers (MSIE, Firefox, Chrome, Safari, Opera). Specifically I want the regexp NOT to match if the user-agent is a bot (Googlebot, msnbot, etc.).

Currently I'm using the following regexp which appears to achieve the desired precision:

^(Mozilla.*(Gecko|KHTML|MSIE|Presto|Trident)|Opera).*$

I've observed small number of false negatives which are mostly mobile browsers. The exceptions all match:

(BlackBerry|HTC|LG|MOT|Nokia|NOKIAN|PLAYSTATION|PSP|SAMSUNG|SonyEricsson)

My question is: Given the desired accuracy level, how would you improve the regexp? Can you think of any major false positives or false negatives to the given regexp?

Please note that the question is specifically about regexp-based User-Agent matching. There are a bunch of other approaches to solving this problem, but those are out of the scope of this question.

Sjoerd · Accepted Answer

You could construct a blacklist by checking which user agents access robots.txt.

Regexp that matches user-agents of end-user browsers but NOT crawlers with >90 % accuracy

Tags:

browser

regex

user-agent

knorv

1 Answers

Sjoerd

Recent Activity

Donate For Us