Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Am I only checking for links of a URL with this java code?

I have a method that takes in URL and finds all the links on that page. However I am concerned if it is only taking links as when I check if the links are working or not, some of the links seem strange. For example if I check the links at www.google.com I get 6 broken links that return no http status code and instead says there is 'no protocol'for that broken link. I just wouldn't imagine google would have any broken links on its homepage. An example of one of the broken links is: /preferences?hl=en I can't see where this link is on the google homepage. I am curious if I am checking just links or is it possible I am extracting code that is not supposed to be a link?

Here is the method that checks the URL for links:

public static List getLinks(String uriStr) {

    List result = new ArrayList<String>();
    //create a reader on the html content
    try{
        System.out.println("in the getlinks try");
    URL url = new URI(uriStr).toURL();
    URLConnection conn = url.openConnection();
    Reader rd = new InputStreamReader(conn.getInputStream());

    // Parse the HTML
    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument)kit.createDefaultDocument();
    kit.read(rd, doc, 0);

    // Find all the A elements in the HTML document
    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.A);
    while (it.isValid()) {
        SimpleAttributeSet s = (SimpleAttributeSet)it.getAttributes();

        String link = (String)s.getAttribute(HTML.Attribute.HREF);
        if (link != null) {
                // Add the link to the result list
                System.out.println(link);
            //System.out.println("link print finished");
            result.add(link);
        }
        //System.out.println(link);
        it.next();
    }
    }
like image 477
user1835504 Avatar asked Nov 12 '22 06:11

user1835504


1 Answers

There is nothing wrong with the link that you are getting back.

Looking at your code you are extracting the href attribute, which in the case of your example is from the element:

<a  class=gbmt href="/preferences?hl=en">Search settings</a>

(You can see this link if you click on the bottom right on "Settings", a list should popup with several links)

As you can see the href attribute only contains /preferences?hl=en, which simply makes it a relative link. The full url would be the address of the page you are currently at + the href. In this case:

http://www.google.com/preferences?hl=en

You just need to tweak your code to prepend the argument of your method if the url is relative.

like image 164
Francisco Paulo Avatar answered Nov 14 '22 21:11

Francisco Paulo