Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

robots.txt to disallow all pages except one? Do they override and cascade?

Tags:

robots.txt

I want one page of my site to be crawled and no others.

Also, if it's any different than the answer above, I would also like to know the syntax for disallowing everything but the root (index) of the website is.

# robots.txt for http://example.com/

User-agent: *
Disallow: /style-guide
Disallow: /splash
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc
Disallow: /etc

Or can I do like this?

# robots.txt for http://example.com/

User-agent: *
Disallow: /
Allow: /under-construction

Also I should mention that this is a WordPress install, so "under-construction," for example, is set to the front page. So in that case it acts as the index.

I think what I need is to have http://example.com craweld, but no other pages.

# robots.txt for http://example.com/

User-agent: *
Disallow: /*

Would this mean disallow anything after the root?

like image 813
nouveau Avatar asked Nov 08 '13 21:11

nouveau


People also ask

What does disallow do in robots txt?

Disallow directive in robots. txt. You can tell search engines not to access certain files, pages or sections of your website. This is done using the Disallow directive.

What type of pages should be excluded through robots txt?

If your web page is blocked with a robots. txt file, its URL can still appear in search results, but the search result will not have a description. Image files, video files, PDFs, and other non-HTML files will be excluded. If you see this search result for your page and want to fix it, remove the robots.

How do I bypass robots txt disallow?

If you don't want your crawler to respect robots. txt then just write it so it doesn't. You might be using a library that respects robots. txt automatically, if so then you will have to disable that (which will usually be an option you pass to the library when you call it).


3 Answers

The easiest way to allow access to just one page would be:

User-agent: * Allow: /under-construction Disallow: / 

The original robots.txt specification says that crawlers should read robots.txt from top to bottom, and use the first matching rule. If you put the Disallow first, then many bots will see it as saying they can't crawl anything. By putting the Allow first, those that apply the rules from top to bottom will see that they can access that page.

The expression rules are simple: the expression Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.

Your Disallow: /* means the same thing to Googlebot and Bingbot, but bots that don't support wildcards could see the /* and think that you meant a literal *. So they could assume that it was okay to crawl /*foo/bar.html.

If you just want to crawl http://example.com, but nothing else, you might try:

Allow: /$ Disallow: / 

The $ means "end of string," just like in regular expressions. Again, that'll work for Google and Bing, but won't work for other crawlers if they don't support wildcards.

like image 80
Jim Mischel Avatar answered Sep 28 '22 08:09

Jim Mischel


If you log into Google Webmaster Tools, from the left panel go to crawling, then go to Fetch as Google. Here you can test how Google will crawl each page.

In the case of blocking everything but the homepage:

User-agent: * Allow: /$ Disallow: / 

will work.

like image 42
Kohjah Breese Avatar answered Sep 28 '22 09:09

Kohjah Breese


you can use this below both will work

User-agent: *
Allow: /$
Disallow: /

or

User-agent: *
Allow: /index.php
Disallow: /

the Allow must be before the Disallow because the file is read from top to bottom

Disallow: / says "disallow anything that starts with a slash." So that means everything on the site.

The $ means "end of string," like in regular expressions. so the result of Allow : /$ is your homepage /index

like image 29
Aominé Avatar answered Sep 28 '22 09:09

Aominé