Short question:
Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.
Long question:
I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.
The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah
for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.
I would have to assume that someone might publish the /oprah
link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.
So at the same time I was creating the sitemap I also added URLS such as /oprah
to be excluded from our robots.txt
file.
Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.
However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?
Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:
- Check for, and if present, download and parse the robots.txt file on the site
- Provide an interface for the Spider to check each Url against the robots.txt rules
I like the code and tests in http://code.google.com/p/robotstxt/ would recommend it as a starting point.
A bit of self promoting, but since I needed a similar parser and couldn't find anything I was happy with, I created my own:
http://nrobots.codeplex.com/
I'd love any feedback
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With