Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Open a connection with Jsoup, get status code and parse document

I'm creating a class using jsoup that will do the following:

  1. The constructor opens a connection to a url.
  2. I have a method that will check the status of the page. i.e. 200, 404 etc.
  3. I have a method to parse the page and return a list of urls.#

Below is a rough working of what I am trying to do, not its very rough as I've been trying a lot of different things

public class ParsePage {
private String path;
Connection.Response response = null;

private ParsePage(String langLocale){
    try {
        response = Jsoup.connect(path)
                .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
                .timeout(10000)
                .execute();
    } catch (IOException e) {
        System.out.println("io - "+e);
    }
}

public int getSitemapStatus(){
    int statusCode = response.statusCode();
    return statusCode;
}

public ArrayList<String> getUrls(){
    ArrayList<String> urls = new ArrayList<String>();

 }
}

As you can see I can get the page status, but using the already open connection from the constructor I don't know how to get the document to parse, I tried using:

Document doc = connection.get();

But that's a no go. Any suggestions? Or better ways to go about this?

like image 239
Peck3277 Avatar asked May 09 '12 15:05

Peck3277


People also ask

What does jsoup parse do?

What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is jsoup connect?

The connect(String url) method creates a new Connection , and get() fetches and parses a HTML file. If an error occurs whilst fetching the URL, it will throw an IOException , which you should handle appropriately. The Connection interface is designed for method chaining to build specific requests: Document doc = Jsoup.

What is a jsoup document?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.


1 Answers

As stated in the JSoup Documentation for the Connection.Response type, there is a parse() method that parse the response's body as a Document and returns it. When you have that, you can do whatever you want with it.

For example, see the implementation of getUrls()

public class ParsePage {
   private String path;
   Connection.Response response = null;

   private ParsePage(String langLocale){
      try {
         response = Jsoup.connect(path)
            .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
            .timeout(10000)
            .execute();
      } catch (IOException e) {
         System.out.println("io - "+e);
      }
   }

   public int getSitemapStatus() {
      int statusCode = response.statusCode();
      return statusCode;
   }

   public ArrayList<String> getUrls() {
      ArrayList<String> urls = new ArrayList<String>();
      Document doc = response.parse();
      // do whatever you want, for example retrieving the <url> from the sitemap
      for (Element url : doc.select("url")) {
         urls.add(url.select("loc").text());
      }
      return urls;
   }
}
like image 169
Alex Avatar answered Oct 06 '22 00:10

Alex