Allow search bots to crawl your sites without session IDs

Tags:

Google's Webmaster guidelines state

Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in incomplete indexing of your site, as bots may not be able to eliminate URLs that look different but actually point to the same page.

My ASP.NET 1.1 site uses custom authentication/authorization and relies pretty heavily on session guids (similar to this approach). I'm worried that allowing non-session tracked traffic will either break my existing code or introduce security vulnerabilities.

What best practices are there for allowing non-session tracked bots to crawl a normally session tracked site? And are there any ways of detecting search bots other than inspecting the user agent (i don't want people to spoof themselves as googlebot to get around my session tracking)?

758

asked Feb 04 '10 21:02

kenwarner

2 Answers

The correct way to detect bots is by host entry (Dns.GetHostEntry). Some lame robots require you to track by ip address, but the popular ones generally don't. Googlebot requests come from *.googlebot.com. After you get the host entry, you should check in the IPHostEntry.AddressList to make sure it contains the original ip address.

Do not even look at the user agent when verifying robots.

See also http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html

answered Sep 27 '22 16:09

Brian

First of all: We had some issues with simply stripping JSESSIONIDs from responses to known search engines. Most notably, creating a new session for each request caused OutOfMemoryErrors (while you're not using Java, keeping state for thousands of active sessions certainly is a problem for most or all servers/frameworks). This might be solved by reducing session timeout (for bot sessions only - if possible). So if you'd like to go down this path, be warned. And if you do, no need to do DNS lookups. You aren't protecting anything valuable here (compared to Google's First Click Free for instance). If somebody pretends to be a bot that should normally be fine.

Instead, I'd rather suggest to keep tracking sessions (using URL parameters as a fallback for cookies) and add a canonical link tag (<link rel="canonical" href="..." />, obviously without the session id itself) to each page. See "Make Google Ignore JSESSIONID" or an extensive video featuring Matt Cutts for discussion. Adding this tag isn't very intrusive and could possibly be considered good practice anyway. So basically you would end without any dedicated handling of search engine spiders - which certainly is a Good Thing (tm).

answered Sep 27 '22 15:09

sfussenegger

Related questions
                            
                                Default Form Button in FireFox
                            
                                Why is ASP.NET failing due to permissions on GAC?
                            
                                problem with viewstate of dynamic controls inside a repeater
                            
                                Passing session data between ASP.NET Applications
                            
                                Simulating Shared Hosting Trust Levels
                            
                                When are ASP.NET Expression Builders most useful?
                            
                                Can TransactionScope rollback be used with Selenium or Watin?
                            
                                Programmatically adding security permissions to files in C#
                            
                                Theme 'XXX' cannot be found in the application or global theme directories
                            
                                Best way to improve performance (and include somehow failover)
                            
                                Deploying ASP.NET Web Applications from Hudson Build/CI Server
                            
                                Looking for a lightweight ASP.net shopping cart that is PayPal compatible [closed]
                            
                                What is the maximum size a session variable can hold?
                            
                                Unhandled IIS Exception - How can I track it down
                            
                                Performance of GridView vs DataList vs Repeater vs ListView
                            
                                Set default value of telerik:GridDropDownColumn inside of telerik:RadGrid
                            
                                An event log source that's always available for writing?
                            
                                The Model-Repository-Service-Validator-View-ViewModel-Controller design pattern (?)
                            
                                Can I use an X509Certificate2 within ASP.NET without using a certificate store?
                            
                                Access ASP.NET control from static [WebMethod] (JS ajax call)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Allow search bots to crawl your sites without session IDs

Tags:

security

asp.net

session

search-engine-bots

kenwarner

People also ask

2 Answers

Brian

sfussenegger

Recent Activity

Donate For Us