Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should I use different case-spellings for case-insensitive directories in robots.txt?

Unfortunately, I’ve got case-insensitive servers that cannot be replaced in the short term. Some directories need to be excluded from crawling, so I have to Disallow them in my robots.txt. Let’s take /Img/ as example. If I keep it all lower case…

User-agent: *
Disallow: /img/

… it does not map to the actual physical path, and addresses with /Img/ or /IMG/ are not applied the Disallow directive. Crawlers will treat these variations as distinct paths.

It’s fun to look at Microsoft’s robots.txt in this matter. They probably use IIS servers, and SERPs are just full of disallowed addresses–only with other cases.

What can I do?
Is it valid (and effectual) to state the following?

User-agent: *
Disallow: /Img/
Disallow: /img/
Disallow: /IMG/
like image 633
dakab Avatar asked Feb 25 '14 11:02

dakab


People also ask

What should you disallow in robots txt?

Disallow all robots access to everything. All Google bots don't have access. All Google bots, except for Googlebot news don't have access. Googlebot and Slurp don't have any access.

Does order matter in robots txt?

Order of precedence for user agents Other groups are ignored. All non-matching text is ignored (for example, both googlebot/1.2 and googlebot* are equivalent to googlebot ). The order of the groups within the robots. txt file is irrelevant.

What happens if you disobey robots txt?

The Robot Exclusion Standard is purely advisory, it's completely up to you if you follow it or not, and if you aren't doing something nasty chances are that nothing will happen if you choose to ignore it.


2 Answers

The original robots.txt specification doesn't say anything about typecase in file paths, but according to Google's robots.txt specification, file paths are definitely case-sensitive. Google clearly states that "Disallow: /img/" only blocks "/img/", not "/Img/" or "/IMG/". Your solution is definitely valid, and will solve the problem.

That being said, I would only resort to this solution if I had reason to believe the alternate-case URLs were actually being crawled, and they were causing a problem. You can easily turn your robots.txt file into an unmaintainable mess otherwise.

like image 94
plasticinsect Avatar answered Oct 20 '22 19:10

plasticinsect


As the Disallow field takes (beginnings of) URL paths as value, and URL paths are case-sensitive, your assumption is correct.

So yes, if you want to block all URLs whose paths start with case-insensitive /img/, you’d need to add:

Disallow: /img/
Disallow: /IMG/
Disallow: /Img/
Disallow: /IMg/
Disallow: /ImG/
Disallow: /iMg/
Disallow: /iMG/
Disallow: /imG/
like image 24
unor Avatar answered Oct 20 '22 19:10

unor