I'm looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot.
I've found some librairies to parse robots.txt files, but none of them supports Googlebot-style pattern matching :
Does anyone know of a java library that can do this ?
While typical formatting in robots. txt will prevent the crawling of the pages in a directory or a specific URL, using wildcards in your robots. txt file will allow you to prevent search engines from accessing content based on patterns in URLs – such as a parameter or the repetition of a character.
A robots. txt file consists of one or more blocks of directives, each starting with a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or particular blocks for particular search engines.
Google officially announced that GoogleBot will no longer obey a Robots. txt directive related to indexing. Publishers relying on the robots. txt noindex directive have until September 1, 2019 to remove it and begin using an alternative.
Disallow directive in robots. txt. You can tell search engines not to access certain files, pages or sections of your website. This is done using the Disallow directive.
Nutch seems to be using a combination of crawler-commons with some custom code (see RobotsRulesParser.java). I'm not sure of the current state of afairs, though.
In particular, the issue NUTCH-1455 looks to be quite related to your needs:
If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*"
Perhaps its worth it to try/patch/submit the fix :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With