Retrieve comments from website using disqus

Tags:

disqus

I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1

I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.

Any idea or suggestion?

Thanks so much!

371

asked Jan 20 '12 06:01

qwertyl

2 Answers

I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:

var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request

also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.

I used javascript nodejs to test this, ie :

var request = require("request");

var publicKey  = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";

var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";

request(disqusUri, function(res,status,err){
    console.log(res.body);

    if(err){
        console.log("ERR: " + err);
    }
});

answered Sep 21 '22 01:09

med116

The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python

I'll provide a little example using Watin & C#:

IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing

Notice: I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments

answered Sep 22 '22 01:09

Boaz

Related questions
                            
                                How do I use Scrapy to crawl within pages?
                            
                                Web data scraping (online news comments) with Scrapy (Python)
                            
                                Scraping HTML and JavaScript
                            
                                Detect when a web page is loaded without using sleep
                            
                                Scrapy with Selenium crawling but not scraping
                            
                                LoadError: cannot load such file -- capybara Stand Alone Code
                            
                                Find and fill an input field with AutoHotKey
                            
                                Scraping Data From Interactive Map
                            
                                Scrapy error : exceptions.AttributeError: 'HtmlResponse' object has no attribute 'urljoin'
                            
                                Scrapy - handle exception when one of item fields is not returned
                            
                                Php webscraping using simple html dom not working when output is out of order html tags
                            
                                Download .xls files from a webpage using Python and BeautifulSoup
                            
                                How to ignore an invalid SSL certificate with requests_html?
                            
                                Getting Text After a Word--R Webscraping
                            
                                How to extract value from span tag
                            
                                Find div text through div label with beautifulsoup
                            
                                Scrape website's Power BI dashboard using R
                            
                                Can't get rid of unwanted stuff while scraping email addresses
                            
                                Scraping data from a dynamic web table
                            
                                Looking for a simple Java spider [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With