Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JSoup UserAgent, how to set it right?

Tags:

jsoup

I'm trying to parse the frontpage of facebook with JSoup but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0).

I'm setting my User Agent like this:

doc = Jsoup.connect(url)       .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0")       .get(); 

Am I doing something wrong?

EDIT:

I just parsed http://whatsmyuseragent.com/ and it looks like the user Agent is working. Now its even more confusing for me why the site http://www.facebook.com/ returns a different version when using JSoup and my browser. Both are using the same useragent....

I noticed this behaviour on some other sites too now. If you could explain to me what the Issue is I would be more than happy.

like image 255
Markus Avatar asked Jul 05 '11 11:07

Markus


People also ask

What does jsoup parse do?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.

What is jsoup connect?

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.


2 Answers

You might try setting the referrer header as well:

doc = Jsoup.connect("https://www.facebook.com/")       .userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")       .referrer("http://www.google.com")       .get(); 
like image 90
Denaitre Roux Avatar answered Sep 23 '22 13:09

Denaitre Roux


Response response= Jsoup.connect(location)            .ignoreContentType(true)            .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")              .referrer("http://www.google.com")               .timeout(12000)             .followRedirects(true)            .execute();  Document doc = response.parse(); 

User Agent

Use the latest User agent. Here's the complete list http://www.useragentstring.com/pages/useragentstring.php.

Timeout

Also don't forget to add timout, since sometimes it takes more than normal timeout to download the page.

Referer

Set the referer as google.

Follow redirects

follow redirects to get to the page.

execute() instead of get()

Use execute() to get the Response object. Which can help you to check for content type and status codes incase of error.

Later you can parse the response object to obtain the document.

like image 22
Sorter Avatar answered Sep 23 '22 13:09

Sorter



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!