Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to disallow all dynamic urls robots.txt [closed]

Tags:

robots.txt

how to disallow all dynamic urls in robots.txt

Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

i want to disallow all things that start with /?q=

like image 931
pmarreddy Avatar asked Sep 29 '09 22:09

pmarreddy


People also ask

How do I bypass robots.txt disallow?

Sites do not need to include an “allow” directive. The “allow” directive is used to override “disallow” directives in the same robots. txt file.

How do I fix robots.txt error?

Luckily, there's a simple fix for this error. All you have to do is update your robots. txt file (example.com/robots.txt) and allow Googlebot (and others) to crawl your pages. You can test these changes using the Robots.

How do you stop all robots?

The “User-agent: *” part means that it applies to all robots. The “Disallow: /” part means that it applies to your entire website. In effect, this will tell all robots and web crawlers that they are not allowed to access or crawl your site.


2 Answers

The answer to your question is to use

Disallow: /?q=

The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)

According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.

The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.

This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.

like image 85
Stephen C Avatar answered Sep 20 '22 21:09

Stephen C


As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.

That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.

If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.

If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.

More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.

In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:

User-agent: * 
Allow: /index.html
Allow: /offices
Allow: /static 
Disallow: /

This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.

like image 41
Ben Dadsetan Avatar answered Sep 18 '22 21:09

Ben Dadsetan