I'm trying to construct a regexp that will evaluate to true for User-Agent
:s of "browsers navigated by humans", but false for bots. Needless to say the matching will not be exact, but if it gets things right in say 90 % of cases that is more than good enough.
My approach so far is to target the User-Agent
string of the the five major desktop browsers (MSIE, Firefox, Chrome, Safari, Opera). Specifically I want the regexp NOT to match if the user-agent is a bot (Googlebot, msnbot, etc.).
Currently I'm using the following regexp which appears to achieve the desired precision:
^(Mozilla.*(Gecko|KHTML|MSIE|Presto|Trident)|Opera).*$
I've observed small number of false negatives which are mostly mobile browsers. The exceptions all match:
(BlackBerry|HTC|LG|MOT|Nokia|NOKIAN|PLAYSTATION|PSP|SAMSUNG|SonyEricsson)
My question is: Given the desired accuracy level, how would you improve the regexp? Can you think of any major false positives or false negatives to the given regexp?
Please note that the question is specifically about regexp-based User-Agent
matching. There are a bunch of other approaches to solving this problem, but those are out of the scope of this question.
You could construct a blacklist by checking which user agents access robots.txt.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With