I would like to detect (on the server side) which requests are from bots. I don't care about malicious bots at this point, just the ones that are playing nice. I've seen a few approaches that mostly involve matching the user agent string against keywords like 'bot'. But that seems awkward, incomplete, and unmaintainable. So does anyone have any more solid approaches? If not, do you have any resources you use to keep up to date with all the friendly user agents?
If you're curious: I'm not trying to do anything against any search engine policy. We have a section of the site where a user is randomly presented with one of several slightly different versions of a page. However if a web crawler is detected, we'd always give them the same version so that the index is consistent.
Also I'm using Java, but I would imagine the approach would be similar for any server-side technology.
If you are able to identify requests that originate from the crawler's IP range, you are set. There are two methods of verifying the IP: Some search engines provide IP lists or ranges. You can verify the crawler by matching its IP with the provided list.
Google's main crawler is called Googlebot.
Because it is not possible to know how many total webpages there are on the Internet, web crawler bots start from a seed, or a list of known URLs. They crawl the webpages at those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list of pages to crawl next.
You said matching the user agent on ‘bot’ may be awkward, but we’ve found it to be a pretty good match. Our studies have shown that it will cover about 98% of the hits you receive. We also haven’t come across any false positive matches yet either. If you want to raise this up to 99.9% you can include a few other well-known matches such as ‘crawler’, ‘baiduspider’, ‘ia_archiver’, ‘curl’ etc. We’ve tested this on our production systems over millions of hits.
Here are a few c# solutions for you:
Is the fastest when processing a miss. i.e. traffic from a non-bot – a normal user. Catches 99+% of crawlers.
bool iscrawler = Regex.IsMatch(Request.UserAgent, @"bot|crawler|baiduspider|80legs|ia_archiver|voyager|curl|wget|yahoo! slurp|mediapartners-google", RegexOptions.IgnoreCase);
Is the fastest when processing a hit. i.e. traffic from a bot. Pretty fast for misses too. Catches close to 100% of crawlers. Matches ‘bot’, ‘crawler’, ‘spider’ upfront. You can add to it any other known crawlers.
List<string> Crawlers3 = new List<string>() { "bot","crawler","spider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google", "lwp-trivial","nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne", "atn_worldwide","atomz","bjaaland","ukonline","calif","combine","cosmos","cusco", "cyberspyder","digger","grabber","downloadexpress","ecollector","ebiness","esculapio", "esther","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm", "gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","havindex","hotwired", "htdig","ingrid","informant","inspectorwww","iron33","teoma","ask jeeves","jeeves", "image.kapsi.net","kdd-explorer","label-grabber","larbin","linkidator","linkwalker", "lockon","marvin","mattie","mediafox","merzscope","nec-meshexplorer","udmsearch","moget", "motor","muncher","muninn","muscatferret","mwdsearch","sharp-info-agent","webmechanic", "netscoop","newscan-online","objectssearch","orbsearch","packrat","pageboy","parasite", "patric","pegasus","phpdig","piltdownman","pimptrain","plumtreewebaccessor","getterrobo-plus", "raven","roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au", "searchprocess","senrigan","shagseeker","site valet","skymob","slurp","snooper","speedy", "curl_image_client","suke","www.sygol.com","tach_bw","templeton","titin","topiclink","udmsearch", "urlck","valkyrie libwww-perl","verticrawl","victoria","webscout","voyager","crawlpaper", "webcatcher","t-h-u-n-d-e-r-s-t-o-n-e","webmoose","pagesinventory","webquest","webreaper", "webwalker","winona","occam","robi","fdse","jobo","rhcs","gazz","dwcp","yeti","fido","wlm", "wolp","wwwc","xget","legs","curl","webs","wget","sift","cmc" }; string ua = Request.UserAgent.ToLower(); bool iscrawler = Crawlers3.Exists(x => ua.Contains(x));
Is pretty fast, but a little slower than options 1 and 2. It’s the most accurate, and allows you to maintain the lists if you want. You can maintain a separate list of names with ‘bot’ in them if you are afraid of false positives in future. If we get a short match we log it and check it for a false positive.
// crawlers that have 'bot' in their useragent List<string> Crawlers1 = new List<string>() { "googlebot","bingbot","yandexbot","ahrefsbot","msnbot","linkedinbot","exabot","compspybot", "yesupbot","paperlibot","tweetmemebot","semrushbot","gigabot","voilabot","adsbot-google", "botlink","alkalinebot","araybot","undrip bot","borg-bot","boxseabot","yodaobot","admedia bot", "ezooms.bot","confuzzledbot","coolbot","internet cruiser robot","yolinkbot","diibot","musobot", "dragonbot","elfinbot","wikiobot","twitterbot","contextad bot","hambot","iajabot","news bot", "irobot","socialradarbot","ko_yappo_robot","skimbot","psbot","rixbot","seznambot","careerbot", "simbot","solbot","mail.ru_bot","spiderbot","blekkobot","bitlybot","techbot","void-bot", "vwbot_k","diffbot","friendfeedbot","archive.org_bot","woriobot","crystalsemanticsbot","wepbot", "spbot","tweetedtimes bot","mj12bot","who.is bot","psbot","robot","jbot","bbot","bot" }; // crawlers that don't have 'bot' in their useragent List<string> Crawlers2 = new List<string>() { "baiduspider","80legs","baidu","yahoo! slurp","ia_archiver","mediapartners-google","lwp-trivial", "nederland.zoek","ahoy","anthill","appie","arale","araneo","ariadne","atn_worldwide","atomz", "bjaaland","ukonline","bspider","calif","christcrawler","combine","cosmos","cusco","cyberspyder", "cydralspider","digger","grabber","downloadexpress","ecollector","ebiness","esculapio","esther", "fastcrawler","felix ide","hamahakki","kit-fireball","fouineur","freecrawl","desertrealm", "gammaspider","gcreep","golem","griffon","gromit","gulliver","gulper","whowhere","portalbspider", "havindex","hotwired","htdig","ingrid","informant","infospiders","inspectorwww","iron33", "jcrawler","teoma","ask jeeves","jeeves","image.kapsi.net","kdd-explorer","label-grabber", "larbin","linkidator","linkwalker","lockon","logo_gif_crawler","marvin","mattie","mediafox", "merzscope","nec-meshexplorer","mindcrawler","udmsearch","moget","motor","muncher","muninn", "muscatferret","mwdsearch","sharp-info-agent","webmechanic","netscoop","newscan-online", "objectssearch","orbsearch","packrat","pageboy","parasite","patric","pegasus","perlcrawler", "phpdig","piltdownman","pimptrain","pjspider","plumtreewebaccessor","getterrobo-plus","raven", "roadrunner","robbie","robocrawl","robofox","webbandit","scooter","search-au","searchprocess", "senrigan","shagseeker","site valet","skymob","slcrawler","slurp","snooper","speedy", "spider_monkey","spiderline","curl_image_client","suke","www.sygol.com","tach_bw","templeton", "titin","topiclink","udmsearch","urlck","valkyrie libwww-perl","verticrawl","victoria", "webscout","voyager","crawlpaper","wapspider","webcatcher","t-h-u-n-d-e-r-s-t-o-n-e", "webmoose","pagesinventory","webquest","webreaper","webspider","webwalker","winona","occam", "robi","fdse","jobo","rhcs","gazz","dwcp","yeti","crawler","fido","wlm","wolp","wwwc","xget", "legs","curl","webs","wget","sift","cmc" }; string ua = Request.UserAgent.ToLower(); string match = null; if (ua.Contains("bot")) match = Crawlers1.FirstOrDefault(x => ua.Contains(x)); else match = Crawlers2.FirstOrDefault(x => ua.Contains(x)); if (match != null && match.Length < 5) Log("Possible new crawler found: ", ua); bool iscrawler = match != null;
Notes:
Honeypots
The only real alternative to this is to create a ‘honeypot’ link on your site that only a bot will reach. You then log the user agent strings that hit the honeypot page to a database. You can then use those logged strings to classify crawlers.
Postives:
It will match some unknown crawlers that aren’t declaring themselves.
Negatives:
Not all crawlers dig deep enough to hit every link on your site, and so they may not reach your honeypot.
You can find a very thorough database of data on known "good" web crawlers in the robotstxt.org Robots Database. Utilizing this data would be far more effective than just matching bot in the user-agent.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With