Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

web page source downloaded through Jsoup is not equal to the actual web page source

Tags:

java

html

url

jsoup

i have a severe concern here. i have searched all through stack overflow and many other sites. every where they give the same solution and i have tried all those but mi am not able to resolve this issue.

i have the following code,

Document doc = Jsoup.connect(url).timeout(30000).get();

Here m using Jsoup library and the result that i am getting is not equal to the actual page source that we can see but right click on the page -> page source. Many parts are missing in the result that i am getting with the above line of code. After searching some sites on Google, i saw this methid,

URL url = new URL(webPage);
        URLConnection urlConnection = url.openConnection();
        urlConnection.setConnectTimeout(10000);
        urlConnection.setReadTimeout(10000);
        InputStream is = urlConnection.getInputStream();
        InputStreamReader isr = new InputStreamReader(is);



        int numCharsRead;
        char[] charArray = new char[1024];
        StringBuffer sb = new StringBuffer();
        while ((numCharsRead = isr.read(charArray)) > 0) {
            sb.append(charArray, 0, numCharsRead);
        }
        String result = sb.toString();          

        System.out.println(result);   

But no Luck. While i was searching over the internet for this problem i saw many sites where it said i had to set the proper charSet and encoding types of the webpage while downloading the page source of a web page. but how will i get to know these things from my code dynamically?? is there any classes in java for that. i went through crawler4j also a bit but it did not to much for me. Please help guys. m stuck with this problem for over a month now. i have tried all my ways i can. so final hope is on the gods of stack overflow who have always helped!!

like image 285
Vasanth Nag K V Avatar asked Jan 11 '23 22:01

Vasanth Nag K V


1 Answers

I had this recently. I'd run into some sort of robot protection. Change your original line to:

Document doc = Jsoup.connect(url)
                    .userAgent("Mozilla/5.0")
                    .timeout(30000)
                    .get();
like image 196
cftygv Avatar answered Feb 05 '23 11:02

cftygv