Extract HTML from URL

Question

I'm using Boilerpipe to extract text from url, using this code:

URL url = new URL("http://www.example.com/some-location/index.html");
String text = ArticleExtractor.INSTANCE.getText(url);

the String text contains just the text of the html page, but I need to extract to whole html code from it.

Is there anyone who used this library and knows how to extract the HTML code?

You can check the demo page for more info on the library.

Goran Jovic · Accepted Answer

For something as simple as this you don't really need an external library:

 URL url = new URL("http://www.google.com");
 InputStream is = (InputStream) url.getContent();
 BufferedReader br = new BufferedReader(new InputStreamReader(is));
 String line = null;
 StringBuffer sb = new StringBuffer();
 while((line = br.readLine()) != null){
   sb.append(line);
 }
 String htmlContent = sb.toString();

Konrad Rudolph · Answer

Just use the KeepEverythingExtractor instead of the ArticleExtractor.

But this is using the wrong tool for the wrong job. What you want is just to download the HTML content of a URL (right?), not extract content. So why use a content extractor?

Paul Vargas · Answer

With Java 7 and a trick of Scanner, you can do the following:

public static String toHtmlString(URL url) throws IOException {
    Objects.requireNonNull(url, "The url cannot be null.");
    try (InputStream is = url.openStream(); Scanner sc = new Scanner(is)) {
        sc.useDelimiter("\A");
        if (sc.hasNext()) {
            return sc.next();
        } else {
            return null; // or empty
        }
    }
}

Extract HTML from URL

Tags:

java

html

string

url

extract

Wassim AZIRAR

3 Answers

Goran Jovic

Konrad Rudolph

Paul Vargas

Recent Activity

Donate For Us

Extract HTML from URL

Tags:

java

html

string

url

extract

Wassim AZIRAR

3 Answers

Goran Jovic

Konrad Rudolph

Paul Vargas

Related questions

Recent Activity

Donate For Us