Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anybody got any C# code to parse robots.txt and evaluate URLS against it

Tags:

c#

robots.txt

Short question:

Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.

Long question:

I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.

The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.

I would have to assume that someone might publish the /oprah link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.

So at the same time I was creating the sitemap I also added URLS such as /oprah to be excluded from our robots.txt file.

Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.

However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?

like image 996
Simon_Weaver Avatar asked Mar 11 '09 05:03

Simon_Weaver


3 Answers

Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:

  1. Check for, and if present, download and parse the robots.txt file on the site
  2. Provide an interface for the Spider to check each Url against the robots.txt rules
like image 77
realMarkusSchmidt Avatar answered Oct 25 '22 08:10

realMarkusSchmidt


I like the code and tests in http://code.google.com/p/robotstxt/ would recommend it as a starting point.

like image 31
Sam Saffron Avatar answered Oct 25 '22 09:10

Sam Saffron


A bit of self promoting, but since I needed a similar parser and couldn't find anything I was happy with, I created my own:

http://nrobots.codeplex.com/

I'd love any feedback

like image 28
SaguiItay Avatar answered Oct 25 '22 09:10

SaguiItay