Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a Jsoup Document to an HTML file?

Tags:

I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:

myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();

How should I write this object to a HTML file? The methods myDoc.html(), myDoc.text() and myDoc.toString() don't output all elements of the document.

Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.

like image 376
Ali Khezeli Avatar asked Jul 11 '14 11:07

Ali Khezeli


People also ask

What is a jsoup document?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

Why do we use jsoup?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.


1 Answers

Use doc.outerHtml().

import org.apache.commons.io.FileUtils;  public void downloadPage() throws Exception {         final Response response = Jsoup.connect("http://www.example.net").execute();         final Document doc = response.parse();          final File f = new File("filename.html");         FileUtils.writeStringToFile(f, doc.outerHtml(), StandardCharsets.UTF_8);     } 

Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.

like image 112
Gondy Avatar answered Sep 23 '22 02:09

Gondy