Using JSoup inclusive the last release 1.7.2 there is a bug parsing invalid HTML with unclosed tags.
Example:
String tmp = "<a href='www.google.com'>Link<p>Error link</a>";
Jsoup.parse(tmp);
The Document that generate is:
<html>
<head></head>
<body>
<a href="www.google.com">Link</a>
<p><a>Error link</a></p>
</body>
</html>
The browsers would generate something as:
<html>
<head></head>
<body>
<a href="www.google.com">Link</a>
<p><a href="www.google.com">Error link</a></p>
</body>
</html>
Jsoup should works as browsers or as source code.
There is any solution? Looking into the API I didn't find anything.
The correct behavior is to act as other browsers when parsing this invalid HTML. Thanks for filing this bug. I've fixed the issue that was preventing the adoption agency from keeping the original attributes in the new node. It will be available in 1.7.3, or you can build from head now.
If your goal is to get the source code like that browsers generate, you could use selenium, and then pass it to Jsoup to parse. but selenium should open a real browser, of course it could open it automatically. Code like this:
public static void main(String[] args) {
//System.setProperty("webdriver.chrome.driver", "./chromedriver.exe");
//WebDriver driver = new ChromeDriver();
WebDriver driver = new FirefoxDriver();
driver.get("file:///C:/Users/jgong/Desktop/a.html");
String html = driver.getPageSource();
System.out.println(html);
driver.quit();
Document doc = Jsoup.parse(html);
System.out.println(doc.html());
}
and a.html is:
<html><head></head><body><a href="www.google.com">Link<p>Error link</a></body></html>
and the result is that you wanted:
<html><head></head> <body> <a href="www.google.com">Link</a><p><ahref="www.google.com">Error link</a> </p></body></html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With