Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simplest way to correctly load html from web page into a string in Java

Tags:

java

html

parsing

Just what the title says.

Help greatly appreciated!

like image 592
Mark Avatar asked Sep 04 '09 21:09

Mark


People also ask

Can you read websites HTML with Java?

Java has built-in tools and third-party libraries for reading/downloading web pages. In the examples, we use HttpClient, URL, JSoup, HtmlCleaner, Apache HttpClient, Jetty HttpClient, and HtmlUnit. In the following examples, we download HTML source from the webcode.me tiny web page.

How do you convert contents to strings in Java?

The readString() method of File Class in Java is used to read contents to the specified file. Return Value: This method returns the content of the file in String format. Note: File. readString() method was introduced in Java 11 and this method is used to read a file's content into String.


2 Answers

An extremely common error is the failure to correctly convert an HTTP response from bytes to characters. To do this, you have to know the character encoding of the response. Hopefully, this is specified as a parameter in the "Content-Type" parameter. But putting it in the body itself, as an "http-equiv" attribute in a meta tag is also an option.

So, it is surprisingly complicated to load a page into a String correctly, and even 3rd party libraries like HttpClient don't offer a general solution.

Here's a simple implementation that will handle the most common case:

URL url = new URL("http://stackoverflow.com/questions/1381617"); URLConnection con = url.openConnection(); Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*"); Matcher m = p.matcher(con.getContentType()); /* If Content-Type doesn't match this pre-conception, choose default and   * hope for the best. */ String charset = m.matches() ? m.group(1) : "ISO-8859-1"; Reader r = new InputStreamReader(con.getInputStream(), charset); StringBuilder buf = new StringBuilder(); while (true) {   int ch = r.read();   if (ch < 0)     break;   buf.append((char) ch); } String str = buf.toString(); 
like image 85
erickson Avatar answered Oct 04 '22 12:10

erickson


You can still simplify it a bit using org.apache.commons.io.IOUtils:

URL url = new URL("http://stackoverflow.com/questions/1381617"); URLConnection con = url.openConnection(); Pattern p = Pattern.compile("text/html;\\s+charset=([^\\s]+)\\s*"); Matcher m = p.matcher(con.getContentType()); /* If Content-Type doesn't match this pre-conception, choose default and   * hope for the best. */ String charset = m.matches() ? m.group(1) : "ISO-8859-1"; String str = IOUtils.toString(con.getInputStream(), charset); 
like image 30
altumano Avatar answered Oct 04 '22 11:10

altumano