how to disallow all dynamic urls robots.txt [closed]

Tags:

robots.txt

how to disallow all dynamic urls in robots.txt

Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

i want to disallow all things that start with /?q=

931

asked Sep 29 '09 22:09

2 Answers

The answer to your question is to use

Disallow: /?q=

The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)

According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.

The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.

This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.

answered Sep 20 '22 21:09

Stephen C

As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.

That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.

If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.

If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.

More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.

In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:

User-agent: * 
Allow: /index.html
Allow: /offices
Allow: /static 
Disallow: /

This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.

answered Sep 18 '22 21:09

Ben Dadsetan

Related questions
                            
                                robots.txt content itself is indexed? [closed]
                            
                                BingBot & BaiduSpider don't respect robots.txt
                            
                                React router v4 serve static file (robot.txt)
                            
                                What does the dollar sign mean in robots.txt
                            
                                How to work with RobotsTxtMiddleware in Scrapy framework?
                            
                                URL Blocking Bots
                            
                                robots.txt: user-agent: Googlebot disallow: / Google still indexing
                            
                                "Lighthouse was unable to download a robots.txt file" despite the file being accessible
                            
                                robots.txt URL format
                            
                                Anybody got any C# code to parse robots.txt and evaluate URLS against it
                            
                                Python requests vs. robots.txt
                            
                                Java robots.txt parser with wildcard support
                            
                                Allow only Google CSE and disallow Google standard search in ROBOTS.txt
                            
                                robots.txt parser java
                            
                                Defaults for robots meta tag
                            
                                What does "Allow: /$" mean in robots.txt
                            
                                Robots.txt: Is this wildcard rule valid?
                            
                                Robots.txt: Disallow subdirectory but allow directory
                            
                                BOT/Spider Trap Ideas
                            
                                Generating a dynamic /robots.txt file in a Next.js app

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With