I'm trying to parse the frontpage of facebook with JSoup but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0).
I'm setting my User Agent like this:
doc = Jsoup.connect(url) .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0") .get(); Am I doing something wrong?
EDIT:
I just parsed http://whatsmyuseragent.com/ and it looks like the user Agent is working. Now its even more confusing for me why the site http://www.facebook.com/ returns a different version when using JSoup and my browser. Both are using the same useragent....
I noticed this behaviour on some other sites too now. If you could explain to me what the Issue is I would be more than happy.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
You might try setting the referrer header as well:
doc = Jsoup.connect("https://www.facebook.com/") .userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6") .referrer("http://www.google.com") .get();
Response response= Jsoup.connect(location) .ignoreContentType(true) .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0") .referrer("http://www.google.com") .timeout(12000) .followRedirects(true) .execute(); Document doc = response.parse(); User Agent
Use the latest User agent. Here's the complete list http://www.useragentstring.com/pages/useragentstring.php.
Timeout
Also don't forget to add timout, since sometimes it takes more than normal timeout to download the page.
Referer
Set the referer as google.
Follow redirects
follow redirects to get to the page.
execute() instead of get()
Use execute() to get the Response object. Which can help you to check for content type and status codes incase of error.
Later you can parse the response object to obtain the document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With