Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve comments from website using disqus

I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1

I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.

Any idea or suggestion?

Thanks so much!

like image 371
qwertyl Avatar asked Jan 20 '12 06:01

qwertyl


People also ask

What is the website Disqus com?

Disqus is a networked community platform used by hundreds of thousands of sites all over the web. With Disqus, your website gains a feature-rich comment system complete with social network integration, advanced administration and moderation options, and other extensive community functions.

How do you get a link to a comment?

Once a comment is created you can find a link to that specific comment by: Click on the menu (three dots) in the comment dialogue box (to the right of the page) Select the Link to this comment option. In the Pop Up box you can use the Copy Link button.

Does Disqus help SEO?

Disqus, on the other hand, allows user comments to be crawlable, adding to your total word count. Any keywords they leave in your comments section will help boost your SEO over time. Disqus also has an excellent spam filter.


2 Answers

I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:

var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request

also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.

I used javascript nodejs to test this, ie :

var request = require("request");

var publicKey  = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";

var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";

request(disqusUri, function(res,status,err){
    console.log(res.body);

    if(err){
        console.log("ERR: " + err);
    }
});
like image 83
med116 Avatar answered Sep 21 '22 01:09

med116


The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python

I'll provide a little example using Watin & C#:

IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing

Notice: I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments

like image 31
Boaz Avatar answered Sep 22 '22 01:09

Boaz