Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Live chat scraping (Youtube) with casper. Issue with selecting polymer elements

I am trying to scrape the text from youtube live chat feeds using casper. I am having problems selecting the correct selector. There are many nested elements and dynamically generated elements for each new message that gets pushed out. How might one go about continually pulling the nested

<span id="message">some message</span>

as they occur? I currently can't seem to grab just even one! Here's my test code: note: you can substitute any youtube url that has a live chat feed.

const casper = require("casper").create({
  viewportSize: {
    width: 1080,
    height: 724
  }
});
const ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'
const url = "https://www.youtube.com/watch?v=NksKCLsMUsI";
casper.start();
casper.userAgent(ua)
casper.thenOpen(url, function() {
  this.wait(3000, function() {
    if (this.exists("span#message")) {
      this.echo("found the a message!");
    } else {
      this.echo("can't find a message");
    }
    casper.capture("test.png");
  });
});

casper.run();

My question is exactly this. How do i properly select the messages? And 2, how might i continually listen for new ones?

UPDATE: I have been playing with nightmare (electron testing suite) and that is looking promising however I still can't seem to select the chat elements. I know i'm missing something simple.

EDIT / UPDATE (using cadabra's fine example)

var casper = require("casper").create({
  viewportSize: {
    width: 1024,
    height: 768
  }
});

url = 'https://www.youtube.com/live_chat?continuation=0ofMyAMkGiBDZzhLRFFvTFJVRTFVVlkwZEV4MFRFVWdBUSUzRCUzRDAB'
ua = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'

casper.start(url)
casper.userAgent(ua);

var currentMessage = '';

(function getPosts() {
  var post = null;

  casper.wait(1000, function () {
    casper.capture('test.png')
    post = this.evaluate(function () {
      var nodes = document.querySelectorAll('yt-live-chat-text-message-renderer'),
          author = nodes[nodes.length - 1].querySelector('#author-name').textContent,
          message = nodes[nodes.length - 1].querySelector('#message').textContent;

      return {
        author: author,
        message: message
      };
    });
  });

  casper.then(function () {
    if (currentMessage !== post.message) {
      currentMessage = post.message;
      this.echo(post.author + ' - ' + post.message);
    }
  });

  casper.then(function () {
    getPosts();
  });
})();

casper.run();
like image 767
archae0pteryx Avatar asked May 24 '17 20:05

archae0pteryx


1 Answers

This is much harder than you think... See what I tried, with no success:

1. Use ignore-ssl-errors option

YouTube uses HTTPS. This is a real problem for us because PhantomJS does not like SSL/TLS very much... Here we need to use ignore-ssl-errors. The option can be passed in command line:

casperjs --ignore-ssl-errors=true script.js

2. Access the chat page instead of the iframe

Comments we are trying to scrape are not in the main page. They come from an external page which is loaded in an iframe. In CasperJS, we could use the withFrame() method, but this is useless complexity for something we can access directly...

Main page | Chat page

3. Test with PhantomJS (WebKit) and SlimerJS (Gecko)

Due to YouTube limitations, both browsers give the same result:

Oh no!
It looks like you're using an older version of your browser. Please update it to use live chat.

If you want to test yourself, here is the script:

var casper = require("casper").create({
  viewportSize: {
    width: 1080,
    height: 724
  }
});

casper.start('https://www.youtube.com/live_chat?continuation=0ofMyAMkGiBDZzhLRFFvTFRtdHpTME5NYzAxVmMwa2dBUSUzRCUzRDAB');

casper.wait(5000, function () {
  this.capture('chat.png');
});

casper.run();

PhantomJS: casperjs --ignore-ssl-errors=true script.js

SlimerJS: casperjs --engine=slimerjs script.js

Conclusion: You may need to use a real web browser like Firefox or Chromium to achieve this. An automation framework like Nightwatch.js could help...


EDIT 1

OK, so... Using your user-agent string, this is working:

var casper = require("casper").create({
  viewportSize: {
    width: 1080,
    height: 724
  }
});

casper.userAgent('Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0');

casper.start('https://www.youtube.com/live_chat?continuation=0ofMyAMkGiBDZzhLRFFvTFRtdHpTME5NYzAxVmMwa2dBUSUzRCUzRDAB');

casper.wait(5000, function () {
  this.each(this.evaluate(function () {
    var res = [],
        nodes = document.querySelectorAll('yt-live-chat-text-message-renderer'),
        author = null,
        message = null;

    for (var i = 0; i < nodes.length; i++) {
      author = nodes[i].querySelector('#author-name').textContent.toUpperCase();
      message = nodes[i].querySelector('#message').textContent.toLowerCase();
      res.push(author + ' - ' + message);
    }

    return res;
  }), function (self, post) {
    this.echo(post);
  });
});

casper.run();

With this script, you should see the latest messages from the conversation in your terminal. :)


EDIT 2

Since the video is back, I spent some time modifying my previous code to implement real-time polling with a recursive IIFE. With the following script, I can get the latest comment in the chat stream. A request is sent every second to refresh the content and posts are filtered to avoid duplicates.

var casper = require("casper").create({
  viewportSize: {
    width: 1080,
    height: 724
  }
});

casper.userAgent('Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0');

casper.start('https://www.youtube.com/live_chat?continuation=0ofMyAMkGiBDZzhLRFFvTFRtdHpTME5NYzAxVmMwa2dBUSUzRCUzRDAB');

var currentMessage = '';

(function getPosts() {
  var post = null;

  casper.wait(1000, function () {
    post = this.evaluate(function () {
      var nodes = document.querySelectorAll('yt-live-chat-text-message-renderer'),
          author = nodes[nodes.length - 1].querySelector('#author-name').textContent,
          message = nodes[nodes.length - 1].querySelector('#message').textContent;

      return {
        author: author,
        message: message
      };
    });
  });

  casper.then(function () {
    if (currentMessage !== post.message) {
      currentMessage = post.message;
      this.echo(post.author + ' - ' + post.message);
    }
  });

  casper.then(function () {
    getPosts();
  });
})();

casper.run();

It is working PERFECTLY on my computer.

like image 56
Badacadabra Avatar answered Nov 14 '22 00:11

Badacadabra