I am currently using jsoup in a application to parse and analyses web pages.But I want to make sure that I am adhere to the robot.txt rules and only visit pages which are allowed.
I am pretty sure that jsoup is not made for this and it's all about web scraping and parsing. So I planned to have function/module which should read the robot.txt of the domain/site and identify whether the url I am going to visit is allowed or not.
I did some research and found the followings.But it I am not sure about these so it would be great if one did same kind of project where robot.txt parsing involved please share your thoughts and ideas.
http://sourceforge.net/projects/jrobotx/
https://code.google.com/p/crawler-commons/
http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12
A robots. txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google.
Robots. txt is a file used by websites to let 'search bots' know if or how the site should be crawled and indexed by the search engine. Many sites simply disallow crawling, meaning the site shouldn't be crawled by search engines or other crawler bots.
A late answer just in case you - or someone else - are still looking for a way to do this. I am using https://code.google.com/p/crawler-commons/ in version 0.2 and it seems to work well. Here is a simplified example from the code I use:
String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
+ (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
HttpGet httpget = new HttpGet(hostId + "/robots.txt");
HttpContext context = new BasicHttpContext();
HttpResponse response = httpclient.execute(httpget, context);
if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
// consume entity to deallocate connection
EntityUtils.consumeQuietly(response.getEntity());
} else {
BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
"text/plain", USER_AGENT);
}
robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);
Obviously this is not related to Jsoup in any way, it just checks whether a given URL is allowed to be crawled for a certain USER_AGENT. For fetching the robots.txt I use the Apache HttpClient in version 4.2.1, but this could be replaced by java.net stuff as well.
Please note that this code only checks for allowance or disallowance and does not consider other robots.txt features like "Crawl-delay". But as the crawler-commons provide this feature as well, it can be easily added to the code above.
The above didn't work for me. I took managed to put this together. First time I'm doing Java in 4 years, so I'm sure this can be improved.
public static boolean robotSafe(URL url)
{
String strHost = url.getHost();
String strRobot = "http://" + strHost + "/robots.txt";
URL urlRobot;
try { urlRobot = new URL(strRobot);
} catch (MalformedURLException e) {
// something weird is happening, so don't trust it
return false;
}
String strCommands;
try
{
InputStream urlRobotStream = urlRobot.openStream();
byte b[] = new byte[1000];
int numRead = urlRobotStream.read(b);
strCommands = new String(b, 0, numRead);
while (numRead != -1) {
numRead = urlRobotStream.read(b);
if (numRead != -1)
{
String newCommands = new String(b, 0, numRead);
strCommands += newCommands;
}
}
urlRobotStream.close();
}
catch (IOException e)
{
return true; // if there is no robots.txt file, it is OK to search
}
if (strCommands.contains(DISALLOW)) // if there are no "disallow" values, then they are not blocking anything.
{
String[] split = strCommands.split("\n");
ArrayList<RobotRule> robotRules = new ArrayList<>();
String mostRecentUserAgent = null;
for (int i = 0; i < split.length; i++)
{
String line = split[i].trim();
if (line.toLowerCase().startsWith("user-agent"))
{
int start = line.indexOf(":") + 1;
int end = line.length();
mostRecentUserAgent = line.substring(start, end).trim();
}
else if (line.startsWith(DISALLOW)) {
if (mostRecentUserAgent != null) {
RobotRule r = new RobotRule();
r.userAgent = mostRecentUserAgent;
int start = line.indexOf(":") + 1;
int end = line.length();
r.rule = line.substring(start, end).trim();
robotRules.add(r);
}
}
}
for (RobotRule robotRule : robotRules)
{
String path = url.getPath();
if (robotRule.rule.length() == 0) return true; // allows everything if BLANK
if (robotRule.rule == "/") return false; // allows nothing if /
if (robotRule.rule.length() <= path.length())
{
String pathCompare = path.substring(0, robotRule.rule.length());
if (pathCompare.equals(robotRule.rule)) return false;
}
}
}
return true;
}
And you will need the helper class:
/**
*
* @author Namhost.com
*/
public class RobotRule
{
public String userAgent;
public String rule;
RobotRule() {
}
@Override public String toString()
{
StringBuilder result = new StringBuilder();
String NEW_LINE = System.getProperty("line.separator");
result.append(this.getClass().getName() + " Object {" + NEW_LINE);
result.append(" userAgent: " + this.userAgent + NEW_LINE);
result.append(" rule: " + this.rule + NEW_LINE);
result.append("}");
return result.toString();
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With