Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting Jsoup to support dynamically generated html by JavaScript

right now I'm working on a webcrawler. This one should parse some specific sites and give me an output into an xml-file. Up to this point, it's no problem. The Crawler works and you can customize it realy quickly via a cfg-file. I use Jsoup to parse the HTML-content.

I just added a few more sites and noticed that I got a huge problem with HTML-content that is created via JavaScript. Isn't there a way to make Jsoup supporting Javascript? Or at least get the full HTML-content I can see in my browser.

I already tried HtmlUnit, but this one didn't do well. It did not give me the content I would get in my browser.

Sincerly,

Ogofo

like image 848
Ogofo Avatar asked Sep 27 '12 15:09

Ogofo


People also ask

Does jsoup run JavaScript?

jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.

How to parse HTML file using Java?

HTML parsing is very simple with Jsoup, all you need to call is static method Jsoup. parse() and pass your HTML String to it. JSoup provides several overloaded parse() methods to read HTML file from String, a File, from a base URI, from an URL, and from an InputStream.

How does jsoup work?

jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.


1 Answers

Jsoup does not support javascript and it does not emulate a browser. Just forget about it if you're planning to execute Javascript. In my experience HtmlUnit, which is a headless browser, has given me the best results (always talking about Java frameworks).

One thing that worths trying in HtmlUnit is changing the BrowserVersion (Chrome / InternetEplorer / FireFox) while creating the WebClient instance. Some sites react in a different way and sometimes just changing that value might give you the results you expect to get.

like image 76
Mosty Mostacho Avatar answered Nov 01 '22 01:11

Mosty Mostacho