Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSOUP not downloading complete html if the webpage is big in size. Any alternatives to this or any workarounds?

I was trying to get the HTML page and parse information. I just found out that some of the pages were not completely downloaded using Jsoup. I checked with curl command on command line then the complete page got downloaded. Initially I thought that it was site specific, but then I just tried to parse any big webpage randomly using Jsoup and found that it didn't download the complete webpage. I tried specifying user agent and time out properties still it failed to download. Here is the code I tried:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class JsoupTest {
    public static void main(String[] args) throws MalformedURLException, UnsupportedEncodingException, IOException {
        String urlStr = "http://en.wikipedia.org/wiki/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States";
        URL url = new URL(urlStr);
        String content = "";
        try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"))) {
            for (String line; (line = reader.readLine()) != null;) {
                content += line;
            }
        }
        String article1 = Jsoup.connect(urlStr).get().text();
        String article2 = Jsoup.connect(urlStr).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").referrer("http://www.google.com").timeout(30000).execute().parse().text();
        String article3 = Jsoup.parse(content).text();
        System.out.println("ARTICLE 1 : "+article1);
        System.out.println("ARTICLE 2 : "+article2);
        System.out.println("ARTICLE 3 : "+article3);
    }
}

In Article 1 and 2 when I am using Jsoup to connect to the website I am not getting complete info, but while using URL to connect I am getting the complete Page. So basically Article 3 is complete which was done using URL. I have tried with Jsoup 1.8.1 and Jsoup 1.7.2

like image 786
Jay Dharmendra Solanki Avatar asked Jan 09 '23 01:01

Jay Dharmendra Solanki


1 Answers

Use method maxBodySize:

String article = Jsoup.connect(urlStr).maxBodySize(Integer.MAX_VALUE).get().text();
like image 123
DmitryKanunnikoff Avatar answered Jan 26 '23 07:01

DmitryKanunnikoff