I am trying to scrape data from a website. The website uses Facebook's React. As such the source code that I can parse using Jaunt is completely different to the code I see when inspecting the elements using Chrome's inspector.
I know very little about all of this, but having done some research I think this is something to do with DOM rather than the source code. I need a way to be able to get my hands on this DOM code as the original source contains nothing I want, but I don't have the foggiest idea where to begin (even having read many answers on here).
Here is an example of one the pages I want to scrape. For example to scrape the description I'd want to grab what is in between the tag:
<span class="light-font extended-card-description list-group-item">Example description....</span>
But as you can see this element only appears when you "Inspect Element", and not when I just view the page's source.
My question to you geniuses on here is, how can I grab this DOM Code and start scraping the elements I actually want to?
Forgive me if my terminology is completely off but as I say this is a completely new area for me, and I've done the research that I can.
Thank you very much in advance!
ReactJS, like many other Javascript libraries / frameworks, uses client-side code (Javascript) to render the final HTML. This means that when you, Jaunt, or your browser fetch the HTML source code from the server, it doesn't yet contain the final code the user will see. The browser needs to run the Javascript program(s) contained in the page, in order to generate the final content you wish to scrape.
My favorite tool for this kind of job is CasperJS
It (or rather the PhantomJS tool that CasperJS uses) is a headless browser, meaning it's a version of Webkit (like Chrome or Safari) that has been stripped of all the GUI (windows, buttons, menus.) What's left is a tool that you can run from a terminal or from your Java program. It won't show any window on the screen, but it will fetch the webpages you ask it to; run any Javascript they contain; and then respond to your commands, such as "click on this link", "give me that text", "capture a screenshot", and so on.
Let's start with a simple ReactJS example:
We want to scrape the "Hello John" text, but if you look at the plain HTML source (Ctrl+U or Alt+Ctrl+U) you won't see it. On the other hand, if you open the console in your browser and use the following selector, you will get the text:
> document.querySelector('#helloExample .playgroundPreview').textContent
"Hello John"
Here is a simple CasperJS script to do the same thing:
var casper = require("casper").create();
casper.start("http://facebook.github.io/react/index.html", function() {
this.echo(this.fetchText("#helloExample .playgroundPreview"));
});
casper.run();
You can save it as hello.js
and execute it with casperjs hello.js
from a terminal, or use the equivalent Java code Runtime.getRuntime().exec(...)
Here is a better script, that avoids loading images and third-party resources (such as Facebook button, Twitter button, Google Analytics, and such) cutting the loading time by half. It also adds a waitForSelector
step, so that we don't risk trying to fetch the text before ReactJS has had a chance to create it.
var casper = require("casper").create({
pageSettings: {
loadImages: false
}
});
casper.on('resource.requested', function(requestData, request) {
if (requestData.url.indexOf("http://facebook.github.io/") != 0) {
request.abort();
}
});
casper.start("http://facebook.github.io/react/index.html", function() {
this.waitForSelector("#helloExample .playgroundPreview", function() {
this.echo(this.fetchText("#helloExample .playgroundPreview"));
});
});
casper.run();
How to install CasperJS
I have had some trouble scraping ReactJS and other modern Javascript pages with the older versions of PhantomJS and CasperJS, so I recommend installing PhantomJS 2.0 and the latest CasperJS from GitHub.
For PhantomJS you can just download the official 2.0 package.
For CasperJS, since it's a Python script, you should be able to check out the latest commit from GitHub and link bin/casperjs
onto your PATH. Here's a script for Linux or Mac OS X:
> git clone git://github.com/n1k0/casperjs.git
> cd casperjs
> ln -sf `pwd`/bin/casperjs /usr/local/bin/casperjs
You may also want to comment out the line printing Warning PhantomJS v2.0 ...
from your bin/bootstrap.js
file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With