Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PhantomJS not mimicking browser behavior when looking at YouTube videos

I posted this question to the PhantomJS mailing list a week ago, but have gotten no response. Hoping for better luck here...

I've been trying to use PhantomJS to scrape information from YouTube, but haven't been able to get it working.

Consider a YouTube video embedded into a web page via an iframe element. If you load the URL referenced by the src attribute directly into a browser, you get a full-page version of the video, where the video is encapsulated in an embed element. The embed element is not present in the initial page content; rather, some script tags on the page cause some Javascript to be evaluated which eventually adds the embed element to the DOM. I want to be able to access this embed element when it appears, but it never appears when I load the page in PhantomJS.

Here's the code I'm using:

var page = require("webpage").create();

page.settings.userAgent = "Mozilla/5.0 (X11; rv:24.0) Gecko/20130909 Firefox/24.0";

page.open("https://www.youtube.com/embed/dQw4w9WgXcQ", function (status) {
  if (status !== "success") {
    console.log("Failed to load page");
    phantom.exit();
  } else {
    setTimeout(function () {
      var size = page.evaluate(function () {
        return document.getElementsByTagName("EMBED").length;
      });
      console.log(size);
      phantom.exit();
    }, 15000);
  }
});

I only ever see "0" printed to the console, no matter how long I set the timeout. If I look for "DIV" elements I get "3", and if I look for "SCRIPT" elements I get "5", so the code seems to be sound. I just never find any "EMBED" tags, even though if I load the URL above in my browser I do find one soon after page-load.

Does anyone have any idea what the problem might be? Thanks in advance for any help.

like image 287
Sean Avatar asked May 10 '15 00:05

Sean


1 Answers

Patrick's answer got me on the right track, but the full story is as follows.

Youtube's Javascript probes the browser's capabilities before deciding whether to create some kind of video element. After trawling through the minified code, I was eventually able to fool Youtube into thinking PhantomJS supported HTML5 video by wrapping document.createElement in the page's onInitialized callback.

page.onInitialized = function () {
  page.evaluate(function () {
    var create = document.createElement;
    document.createElement = function (tag) {
      var elem = create.call(document, tag);
      if (tag === "video") {
        elem.canPlayType = function () { return "probably" };
      }
      return elem;
    };
  });
};

However, this was a misstep; to get the <embed> tag I was originally after, I needed to make Youtube's code think PhantomJS supports Flash, not HTML5 video. That's also doable:

page.onInitialized = function () {
  page.evaluate(function () {
    window.navigator = {
      plugins: { "Shockwave Flash": { description: "Shockwave Flash 11.2 e202" } },
      mimeTypes: { "application/x-shockwave-flash": { enabledPlugin: true } }
    };
  });
};

So that's how it's done.

like image 64
Sean Avatar answered Nov 05 '22 20:11

Sean