Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java robots.txt parser with wildcard support

I'm looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot.

I've found some librairies to parse robots.txt files, but none of them supports Googlebot-style pattern matching :

  • Heritrix (there is an open issue on this subject)
  • Crawler4j (looks like the same implementation as Heritrix)
  • jrobotx

Does anyone know of a java library that can do this ?

like image 446
clement Avatar asked Aug 30 '11 12:08

clement


People also ask

What is wildcard in robots txt?

While typical formatting in robots. txt will prevent the crawling of the pages in a directory or a specific URL, using wildcards in your robots. txt file will allow you to prevent search engines from accessing content based on patterns in URLs – such as a parameter or the repetition of a character.

What does User Agent * mean in robots txt?

A robots. txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or particular blocks for particular search engines.

Does Google respect robots txt?

Google officially announced that GoogleBot will no longer obey a Robots. txt directive related to indexing. Publishers relying on the robots. txt noindex directive have until September 1, 2019 to remove it and begin using an alternative.

What should be disallowed in robots txt?

Disallow directive in robots. txt. You can tell search engines not to access certain files, pages or sections of your website. This is done using the Disallow directive.


1 Answers

Nutch seems to be using a combination of crawler-commons with some custom code (see RobotsRulesParser.java). I'm not sure of the current state of afairs, though.

In particular, the issue NUTCH-1455 looks to be quite related to your needs:

If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*"

Perhaps its worth it to try/patch/submit the fix :)

like image 61
aldrinleal Avatar answered Oct 04 '22 00:10

aldrinleal