JSoup parsing invalid HTML with unclosed tags

Question

Using JSoup inclusive the last release 1.7.2 there is a bug parsing invalid HTML with unclosed tags.

Example:

String tmp = "<a href='www.google.com'>Link<p>Error link</a>";
Jsoup.parse(tmp);

The Document that generate is:

<html>
 <head></head>
 <body>
  <a href="www.google.com">Link</a>
  <p><a>Error link</a></p>
 </body>
</html>

The browsers would generate something as:

<html>
 <head></head>
 <body>
  <a href="www.google.com">Link</a>
  <p><a href="www.google.com">Error link</a></p>
 </body>
</html>

Jsoup should works as browsers or as source code.

There is any solution? Looking into the API I didn't find anything.

Jonathan Hedley · Accepted Answer

The correct behavior is to act as other browsers when parsing this invalid HTML. Thanks for filing this bug. I've fixed the issue that was preventing the adoption agency from keeping the original attributes in the new node. It will be available in 1.7.3, or you can build from head now.

user2767013 · Answer

If your goal is to get the source code like that browsers generate, you could use selenium, and then pass it to Jsoup to parse. but selenium should open a real browser, of course it could open it automatically. Code like this:

public static void main(String[] args) {

    //System.setProperty("webdriver.chrome.driver", "./chromedriver.exe");
    //WebDriver driver = new ChromeDriver();
    WebDriver driver = new FirefoxDriver();
    driver.get("file:///C:/Users/jgong/Desktop/a.html");

    String html = driver.getPageSource();
    System.out.println(html);
    driver.quit();
    Document doc = Jsoup.parse(html);
    System.out.println(doc.html());

}

and a.html is:

<html><head></head><body><a href="www.google.com">Link<p>Error link</a></body></html>

and the result is that you wanted:

<html><head></head> <body> <a href="www.google.com">Link</a><p><ahref="www.google.com">Error link</a> </p></body></html>

JSoup parsing invalid HTML with unclosed tags

Tags:

java

html-parsing

web-crawler

jsoup

Javier Salinas

2 Answers

Jonathan Hedley

user2767013

Recent Activity

Donate For Us

JSoup parsing invalid HTML with unclosed tags

Tags:

java

html-parsing

web-crawler

jsoup

Javier Salinas

2 Answers

Jonathan Hedley

user2767013

Related questions

Recent Activity

Donate For Us