Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSoup parsing invalid HTML with unclosed tags

Using JSoup inclusive the last release 1.7.2 there is a bug parsing invalid HTML with unclosed tags.

Example:

String tmp = "<a href='www.google.com'>Link<p>Error link</a>";
Jsoup.parse(tmp);

The Document that generate is:

<html>
 <head></head>
 <body>
  <a href="www.google.com">Link</a>
  <p><a>Error link</a></p>
 </body>
</html>

The browsers would generate something as:

<html>
 <head></head>
 <body>
  <a href="www.google.com">Link</a>
  <p><a href="www.google.com">Error link</a></p>
 </body>
</html>

Jsoup should works as browsers or as source code.

There is any solution? Looking into the API I didn't find anything.

like image 760
Javier Salinas Avatar asked Apr 04 '13 14:04

Javier Salinas


2 Answers

The correct behavior is to act as other browsers when parsing this invalid HTML. Thanks for filing this bug. I've fixed the issue that was preventing the adoption agency from keeping the original attributes in the new node. It will be available in 1.7.3, or you can build from head now.

like image 120
Jonathan Hedley Avatar answered Nov 14 '22 06:11

Jonathan Hedley


If your goal is to get the source code like that browsers generate, you could use selenium, and then pass it to Jsoup to parse. but selenium should open a real browser, of course it could open it automatically. Code like this:

public static void main(String[] args) {

    //System.setProperty("webdriver.chrome.driver", "./chromedriver.exe");
    //WebDriver driver = new ChromeDriver();
    WebDriver driver = new FirefoxDriver();
    driver.get("file:///C:/Users/jgong/Desktop/a.html");

    String html = driver.getPageSource();
    System.out.println(html);
    driver.quit();
    Document doc = Jsoup.parse(html);
    System.out.println(doc.html());

}

and a.html is:

<html><head></head><body><a href="www.google.com">Link<p>Error link</a></body></html>

and the result is that you wanted:

<html><head></head> <body> <a href="www.google.com">Link</a><p><ahref="www.google.com">Error link</a> </p></body></html>
like image 22
user2767013 Avatar answered Nov 14 '22 05:11

user2767013