right now I'm working on a webcrawler. This one should parse some specific sites and give me an output into an xml-file. Up to this point, it's no problem. The Crawler works and you can customize it realy quickly via a cfg-file. I use Jsoup to parse the HTML-content.
I just added a few more sites and noticed that I got a huge problem with HTML-content that is created via JavaScript. Isn't there a way to make Jsoup supporting Javascript? Or at least get the full HTML-content I can see in my browser.
I already tried HtmlUnit, but this one didn't do well. It did not give me the content I would get in my browser.
Sincerly,
Ogofo
jsoup will not run JavaScript for you - if you need that in your app I'd recommend looking at JCEF.
HTML parsing is very simple with Jsoup, all you need to call is static method Jsoup. parse() and pass your HTML String to it. JSoup provides several overloaded parse() methods to read HTML file from String, a File, from a base URI, from an URL, and from an InputStream.
jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
Jsoup does not support javascript and it does not emulate a browser. Just forget about it if you're planning to execute Javascript. In my experience HtmlUnit, which is a headless browser, has given me the best results (always talking about Java frameworks).
One thing that worths trying in HtmlUnit is changing the BrowserVersion
(Chrome / InternetEplorer / FireFox) while creating the WebClient
instance. Some sites react in a different way and sometimes just changing that value might give you the results you expect to get.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With