I am a beginner to crawling. I have a requirement to fetch the posts and comments from a link. I want to automate this process. I considered using webcrawler and jsoup for this but was told that webcrawlers are mostly used for websites with greater depth. Sample for a page: Jive community website For this page, when I view the source of the page, I can see only the post and not the comments. Think this is because comments are fetched through an AJAX call to the server. Hence, when I use jsoup, it doesn't fetch the comments. So how can I automate the process of fetching posts and comments?

Jsoup is a html parser only. Unfortunately it's not possible to parse any javascript / ajax content, since jsoup can't execute those. The solution: using a library which can handle Scripts. Here are some examples i know: <ul> <li>HtmlUnit</li> <li>Java Script Engine</li> <li>Apache Commons BSF</li> <li>Rhino</li> </ul> If such a library doesn't support parsing or selectors, you can at least use them to get Html out of the scripts (which then can be parsed by jsoup).

Fetch contents(loaded through AJAX call) of a web page

Tags:

web-crawler

jsoup

I am a beginner to crawling. I have a requirement to fetch the posts and comments from a link. I want to automate this process. I considered using webcrawler and jsoup for this but was told that webcrawlers are mostly used for websites with greater depth.

Sample for a page: Jive community website

For this page, when I view the source of the page, I can see only the post and not the comments. Think this is because comments are fetched through an AJAX call to the server.

Hence, when I use jsoup, it doesn't fetch the comments.

So how can I automate the process of fetching posts and comments?

715

asked Dec 17 '13 11:12

Adarsh Konchady

1 Answers

Jsoup is a html parser only. Unfortunately it's not possible to parse any javascript / ajax content, since jsoup can't execute those.

The solution: using a library which can handle Scripts.

Here are some examples i know:

HtmlUnit
Java Script Engine
Apache Commons BSF
Rhino

If such a library doesn't support parsing or selectors, you can at least use them to get Html out of the scripts (which then can be parsed by jsoup).

200

answered Sep 17 '22 22:09

ollo

Related questions
                            
                                Jsoup connect doesn't work correctly when link has Turkish letters
                            
                                how to make jsoup wait for the complete page(skip a progress page) to load? [duplicate]
                            
                                How to limit download size with jsoup?
                            
                                Iterate through elements in html tree using BeautifulSoup, and produce an output that maintains the relative position of each element? in Python
                            
                                Jsoup having problems with special HTML symbols, &lsquo; &mdash; etc
                            
                                How can I get all div elements with jsoup?
                            
                                jsoup don't get full data
                            
                                Jsoup converting & to &amp; when I require that info as it is
                            
                                Jsoup Get Element only if it exists
                            
                                Java JSoup error fetching URL
                            
                                How to encode properly this URL
                            
                                Extract href values inside td tags in jsoup
                            
                                How to find elements by substring of ID using selector-syntax Jsoup?
                            
                                java.net.SocketTimeoutException: Read timed out error when trying to read from table
                            
                                UserAgent in JSOUP?
                            
                                Error when releasing APK with AsyncTask and Jsoup
                            
                                Using JSoup to scrape Google Results
                            
                                getting javax.net.ssl.SSLException: Received fatal alert: protocol_version while scraping data using Jsoup
                            
                                Get text without tags from web page using Jsoup
                            
                                Installing a JAR file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With