I'm looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot. I've found some librairies to parse robots.txt files, but none of them supports Googlebot-style pattern matching : <ul> <li>Heritrix (there is an open issue on this subject)</li> <li>Crawler4j (looks like the same implementation as Heritrix)</li> <li>jrobotx</li> </ul> Does anyone know of a java library that can do this ?

Nutch seems to be using a combination of crawler-commons with some custom code (see RobotsRulesParser.java). I'm not sure of the current state of afairs, though. In particular, the issue NUTCH-1455 looks to be quite related to your needs: <blockquote> If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*" </blockquote> Perhaps its worth it to try/patch/submit the fix :)

Java robots.txt parser with wildcard support

1 Answers

Nutch seems to be using a combination of crawler-commons with some custom code (see RobotsRulesParser.java). I'm not sure of the current state of afairs, though.

In particular, the issue NUTCH-1455 looks to be quite related to your needs:

If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*"

Perhaps its worth it to try/patch/submit the fix :)

answered Oct 04 '22 00:10

aldrinleal

Related questions
                            
                                How to convert PDF file into PPT file using java?
                            
                                regular expression containing unicode words
                            
                                Swing component and jdk version issue
                            
                                Android application using Socket to send and receive messages:
                            
                                Turning one annotation into many annotations with AspectJ
                            
                                Is it possible to find where a specific thrown exception could be caught in eclipse?
                            
                                Programmatically adding Java app to startup
                            
                                Use Map as property of model in Play
                            
                                AWT fast graphics & thread safety
                            
                                Java: How do you enforce a class type parameter to be the same as generic type specified in a generic method?
                            
                                Adding language profile to Apache Tika
                            
                                Data Structure for representing patterns in strings
                            
                                Java TLS-PSK socket
                            
                                Using wordnet (or some simple dictionary) to check if a noun is countable or uncountable from Java program
                            
                                Why does Hibernate insert a parent row with a foreign key without inserting the child row?
                            
                                Query Android's SQLiteDatabase using Regex
                            
                                Spring MVC - AlwaysUseFullPath configuration for annotation based mappings
                            
                                Where can I find a list of all the possible Java compile time errors?
                            
                                How to properly detect a client disconnect in servlet spec 3?
                            
                                Transaction handling multi-tier application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java robots.txt parser with wildcard support

Tags:

java

web-applications

wildcard

robots.txt

clement

People also ask

1 Answers

aldrinleal

Recent Activity

Donate For Us