my client has a load of pages which they dont want indexed by google - they are all called <pre class="prettyprint"><code>http://example.com/page-xxx </code></pre> so they are /page-123 or /page-2 or /page-25 etc Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt would something ike this work? <pre class="prettyprint"><code>Disallow: /page-* </code></pre> Thanks

In the first place, a line that says <code>Disallow: /post-*</code> isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"? Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied. Your question is unclear as to where the pages are. If they're all in the root directory, then a simple <code>Disallow: /page-</code> will work. If they're scattered across directories in many different places, then things are a bit more difficult. As @user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages. For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work: <code>Disallow: /*page-</code> That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work: <code>Disallow: */page-</code>

block google robots for URLS containing a certain word

Tags:

robots.txt

my client has a load of pages which they dont want indexed by google - they are all called

http://example.com/page-xxx

so they are /page-123 or /page-2 or /page-25 etc

Is there a way to stop google indexing any page that starts with /page-xxx using robots.txt

would something ike this work?

Disallow: /page-*

Thanks

220

asked Jul 28 '11 13:07

JorgeLuisBorges

2 Answers

In the first place, a line that says Disallow: /post-* isn't going to do anything to prevent crawling of pages of the form "/page-xxx". Did you mean to put "page" in your Disallow line, rather than "post"?

Disallow says, in essence, "disallow urls that start with this text". So your example line will disallow any url that starts with "/post-". (That is, the file is in the root directory and its name starts with "post-".) The asterisk in this case is superfluous, as it's implied.

Your question is unclear as to where the pages are. If they're all in the root directory, then a simple Disallow: /page- will work. If they're scattered across directories in many different places, then things are a bit more difficult.

As @user728345 pointed out, the easiest way (from a robots.txt standpoint) to handle this is to gather all of the pages you don't want crawled into one directory, and disallow access to that. But I understand if you can't move all those pages.

For Googlebot specifically, and other bots that support the same wildcard semantics (there are a surprising number of them, including mine), the following should work:

Disallow: /*page-

That will match anything that contains "page-" anywhere. However, that will also block something like "/test/thispage-123.html". If you want to prevent that, then I think (I'm not sure, as I haven't tried it) that this will work:

Disallow: */page-

answered Sep 28 '22 11:09

Jim Mischel

It looks like the * will work as a Google wild card, so your answer will keep Google from crawling, however wildcards are not supported by other spiders. You can search google for robot.txt wildcards for more info. I would see http://seogadget.co.uk/wildcards-in-robots-txt/ for more information.

Then I pulled this from Google's documentation:

Pattern matching

Googlebot (but not all search engines) respects some pattern matching.

To match a sequence of characters, use an asterisk (*). For instance, to block access to all >subdirectories that begin with private:

User-agent: Googlebot Disallow: /private*/

To block access to all URLs that include a question mark (?) (more specifically, any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string):

User-agent: Googlebot Disallow: /*?

To specify matching the end of a URL, use $. For instance, to block any URLs that end with .xls:

User-agent: Googlebot Disallow: /*.xls$

You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn't crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:

User-agent: * Allow: /?$ Disallow: /?

The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).

The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).

Save your robots.txt file by downloading the file or copying the contents to a text file and saving as robots.txt. Save the file to the highest-level directory of your site. The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, http://www.example.com/robots.txt is a valid location, but http://www.example.com/mysite/robots.txt is not.

Note: From what I read this is a Google only approach. Officially there is no Wildcard allowed in robots.txt for disallow.

answered Sep 28 '22 13:09

Travis Pessetto

Related questions
                            
                                How to work with RobotsTxtMiddleware in Scrapy framework?
                            
                                URL Blocking Bots
                            
                                robots.txt: user-agent: Googlebot disallow: / Google still indexing
                            
                                "Lighthouse was unable to download a robots.txt file" despite the file being accessible
                            
                                robots.txt URL format
                            
                                Anybody got any C# code to parse robots.txt and evaluate URLS against it
                            
                                Python requests vs. robots.txt
                            
                                Java robots.txt parser with wildcard support
                            
                                Allow only Google CSE and disallow Google standard search in ROBOTS.txt
                            
                                robots.txt parser java
                            
                                Defaults for robots meta tag
                            
                                What does "Allow: /$" mean in robots.txt
                            
                                Robots.txt: Is this wildcard rule valid?
                            
                                Robots.txt: Disallow subdirectory but allow directory
                            
                                BOT/Spider Trap Ideas
                            
                                Generating a dynamic /robots.txt file in a Next.js app
                            
                                how to disallow all dynamic urls robots.txt [closed]
                            
                                how to restrict the site from being indexed
                            
                                Django - Loading Robots.txt through generic views
                            
                                How to block search engines from indexing all urls beginning with origin.domainname.com

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With