When a crawler reads the User-Agent line of a robots.txt file, does it attempt to match it exactly to its own User-Agent or does it attempt to match it as a substring of its User-Agent?
Everything I have read does not explicitly answer this question. According to another StackOverflow thread it is an exact match.
However, the RFC draft makes me believe that it is a substring match. For example, User-Agent: Google
will match "Googlebot" and "Googlebot-News". Here is the relevant quotation from the RFC:
The robot must obey the first record in
/robots.txt
that contains a User-Agent line whose value contains the name token of the robot as a substring.
Additionally, in the "Order of precedence for user-agents" section of Googlebot's documentation it explains that the user agent for Google Images "Googlebot-Image/1.0
" match for User-Agent: googlebot
.
I would appreciate any clarity here, and the answer may be more complicated than my question. For example, Eugene Kalinin's robots module for node mentions splitting the User-Agent to get the "name token" on line 29 and matching against that. If this is true, then Googlebot's User-Agent "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
" will not match User-Agent: Googlebot
.
In the original robots.txt specification (from 1994), it says:
User-agent
[…]
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
[…]
If and which bots/parsers comply with this is another question and can’t be answered in general.
Every robot does this a little differently. There is really no single reliable way to map the user-agent in robots.txt to the user-agent sent in the request headers. The safest thing to do is to treat them as two separate, arbitrary strings. The only 100% reliable way to find the robots.txt user-agent is to read the official documentation for the given robot.
Edit:
Your best bet is generally to read the official documentation for the given robot, but even this is not 100% accurate. As Michael Marr points out, Google has a robots.txt testing tool that can be used to verify which UA will work with a given robot. This tool reveals that their documentation is inaccurate. Specifically, the page https://developers.google.com/webmasters/control-crawl-index/docs/ claims that their media partner bots respond to the 'Googlebot' UA, but the tool shows that they don't.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With