Parsing robot.txt using java and identify whether an url is allowed

Tags:

I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed.

I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not.

I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please share your thoughts and ideas.

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

435

asked Oct 12 '13 10:10

Emily Webb

2 Answers

A late answer just in case you - or someone else - are still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. Here is a simplified example from the code I use:

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);

Obviously this is not related to Jsoup in any way, it just checks whether a given URL is allowed to be crawled for a certain USER_AGENT. For fetching the robots.txt I use the Apache HttpClient in version 4.2.1, but this could be replaced by java.net stuff as well.

Please note that this code only checks for allowance or disallowance and does not consider other robots.txt features like "Crawl-delay". But as the crawler-commons provide this feature as well, it can be easily added to the code above.

165

answered Oct 02 '22 19:10

korpe

The above didn't work for me. I took managed to put this together. First time I'm doing Java in 4 years, so I'm sure this can be improved.

public static boolean robotSafe(URL url) 
{
    String strHost = url.getHost();

    String strRobot = "http://" + strHost + "/robots.txt";
    URL urlRobot;
    try { urlRobot = new URL(strRobot);
    } catch (MalformedURLException e) {
        // something weird is happening, so don't trust it
        return false;
    }

    String strCommands;
    try 
    {
        InputStream urlRobotStream = urlRobot.openStream();
        byte b[] = new byte[1000];
        int numRead = urlRobotStream.read(b);
        strCommands = new String(b, 0, numRead);
        while (numRead != -1) {
            numRead = urlRobotStream.read(b);
            if (numRead != -1) 
            {
                    String newCommands = new String(b, 0, numRead);
                    strCommands += newCommands;
            }
        }
       urlRobotStream.close();
    } 
    catch (IOException e) 
    {
        return true; // if there is no robots.txt file, it is OK to search
    }

    if (strCommands.contains(DISALLOW)) // if there are no "disallow" values, then they are not blocking anything.
    {
        String[] split = strCommands.split("\n");
        ArrayList<RobotRule> robotRules = new ArrayList<>();
        String mostRecentUserAgent = null;
        for (int i = 0; i < split.length; i++) 
        {
            String line = split[i].trim();
            if (line.toLowerCase().startsWith("user-agent")) 
            {
                int start = line.indexOf(":") + 1;
                int end   = line.length();
                mostRecentUserAgent = line.substring(start, end).trim();
            }
            else if (line.startsWith(DISALLOW)) {
                if (mostRecentUserAgent != null) {
                    RobotRule r = new RobotRule();
                    r.userAgent = mostRecentUserAgent;
                    int start = line.indexOf(":") + 1;
                    int end   = line.length();
                    r.rule = line.substring(start, end).trim();
                    robotRules.add(r);
                }
            }
        }

        for (RobotRule robotRule : robotRules)
        {
            String path = url.getPath();
            if (robotRule.rule.length() == 0) return true; // allows everything if BLANK
            if (robotRule.rule == "/") return false;       // allows nothing if /

            if (robotRule.rule.length() <= path.length())
            { 
                String pathCompare = path.substring(0, robotRule.rule.length());
                if (pathCompare.equals(robotRule.rule)) return false;
            }
        }
    }
    return true;
}

And you will need the helper class:

/**
 *
 * @author Namhost.com
 */
public class RobotRule 
{
    public String userAgent;
    public String rule;

    RobotRule() {

    }

    @Override public String toString() 
    {
        StringBuilder result = new StringBuilder();
        String NEW_LINE = System.getProperty("line.separator");
        result.append(this.getClass().getName() + " Object {" + NEW_LINE);
        result.append("   userAgent: " + this.userAgent + NEW_LINE);
        result.append("   rule: " + this.rule + NEW_LINE);
        result.append("}");
        return result.toString();
    }    
}

answered Oct 02 '22 19:10

coderama

Related questions
                            
                                How to validate the return value when calling a mocked object's method
                            
                                PowerMock class not found
                            
                                Daemon thread- Use case
                            
                                wsimport -clientjar generates classes in (default package)
                            
                                Best way to get top N keys(sorted by values) in a HashMap
                            
                                Does a lock on class, locks class variables too? - java
                            
                                java hexadecimal to int conversion - have to remove 0X ?
                            
                                Compile Android with JDK 1.7 - Android Development Tool
                            
                                Parsing different date formats in a string in Java
                            
                                Convert ArrayList to DefaultListModel
                            
                                Android dependancies configuration
                            
                                How to read a line which doesn't end with new line or carriage return character in java?
                            
                                Java Observer Pattern - How to remove observers during update(notify) loop/iteration?
                            
                                java.lang.OutOfMemoryError when trying to load a bitmap into imageView
                            
                                How to get all terms in index directory created by lucene 4.4.0
                            
                                How to configure locale based date format support in spring
                            
                                How to deserialize an object of unknown class
                            
                                Java method overload choice [duplicate]
                            
                                Recover overwritten file with eclipse git
                            
                                Java List keeps adding the last record, duplicating it for the number of records in file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing robot.txt using java and identify whether an url is allowed

Tags:

java

web-scraping

jsoup

crawler4j

Emily Webb

People also ask

2 Answers

korpe

coderama

Recent Activity

Donate For Us