I have a method that takes in URL and finds all the links on that page. However I am concerned if it is only taking links as when I check if the links are working or not, some of the links seem strange. For example if I check the links at www.google.com I get 6 broken links that return no http status code and instead says there is 'no protocol'for that broken link. I just wouldn't imagine google would have any broken links on its homepage. An example of one of the broken links is: /preferences?hl=en I can't see where this link is on the google homepage. I am curious if I am checking just links or is it possible I am extracting code that is not supposed to be a link?
Here is the method that checks the URL for links:
public static List getLinks(String uriStr) {
List result = new ArrayList<String>();
//create a reader on the html content
try{
System.out.println("in the getlinks try");
URL url = new URI(uriStr).toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.getInputStream());
// Parse the HTML
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
kit.read(rd, doc, 0);
// Find all the A elements in the HTML document
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
while (it.isValid()) {
SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();
String link = (String)s.getAttribute(HTML.Attribute.HREF);
if (link != null) {
// Add the link to the result list
System.out.println(link);
//System.out.println("link print finished");
result.add(link);
}
//System.out.println(link);
it.next();
}
}
There is nothing wrong with the link that you are getting back.
Looking at your code you are extracting the href
attribute, which in the case of your example is from the element:
<a class=gbmt href="/preferences?hl=en">Search settings</a>
(You can see this link if you click on the bottom right on "Settings", a list should popup with several links)
As you can see the href
attribute only contains /preferences?hl=en
, which simply makes it a relative link. The full url would be the address of the page you are currently at + the href. In this case:
http://www.google.com/preferences?hl=en
You just need to tweak your code to prepend the argument of your method if the url is relative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With