Google's Webmaster guidelines state
Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.
My ASP.NET 1.1 site uses custom authentication/authorization and relies pretty heavily on session guids (similar to this approach). I'm worried that allowing non-session tracked traffic will either break my existing code or introduce security vulnerabilities.
What best practices are there for allowing non-session tracked bots to crawl a normally session tracked site? And are there any ways of detecting search bots other than inspecting the user agent (i don't want people to spoof themselves as googlebot to get around my session tracking)?
Bad bots can help steal your private data or take down an otherwise operating website. We want to block any bad bots we can uncover. It's not easy to discover every bot that may crawl your site but with a little bit of digging, you can find malicious ones that you don't want to visit your site anymore.
For most sites, Googlebot shouldn't access your site more than once every few seconds on average. However, due to delays it's possible that the rate will appear to be slightly higher over short periods.
You can prevent a page or other resource from appearing in Google Search by including a noindex meta tag or header in the HTTP response. When Googlebot next crawls that page and sees the tag or header, Google will drop that page entirely from Google Search results, regardless of whether other sites link to it.
All those terms mean the same thing: it's a bot that crawls the web. Googlebot crawls web pages via links. It finds and reads new and updated content and suggests what should be added to the index.
The correct way to detect bots is by host entry (Dns.GetHostEntry
). Some lame robots require you to track by ip address, but the popular ones generally don't. Googlebot requests come from *.googlebot.com. After you get the host entry, you should check in the IPHostEntry.AddressList
to make sure it contains the original ip address.
Do not even look at the user agent when verifying robots.
See also http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
First of all: We had some issues with simply stripping JSESSIONIDs from responses to known search engines. Most notably, creating a new session for each request caused OutOfMemoryErrors (while you're not using Java, keeping state for thousands of active sessions certainly is a problem for most or all servers/frameworks). This might be solved by reducing session timeout (for bot sessions only - if possible). So if you'd like to go down this path, be warned. And if you do, no need to do DNS lookups. You aren't protecting anything valuable here (compared to Google's First Click Free for instance). If somebody pretends to be a bot that should normally be fine.
Instead, I'd rather suggest to keep tracking sessions (using URL parameters as a fallback for cookies) and add a canonical link tag (<link rel="canonical" href="..." />
, obviously without the session id itself) to each page. See "Make Google Ignore JSESSIONID" or an extensive video featuring Matt Cutts for discussion. Adding this tag isn't very intrusive and could possibly be considered good practice anyway. So basically you would end without any dedicated handling of search engine spiders - which certainly is a Good Thing (tm).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With