I've been asked to scrape a site which receives data via websockets and then renders that to the page via javascript/jquery. Is it possible to bypass the middleman (the DOM) and consume/scrape the data coming over the socket? Might this be possible with a headless webkit like phantomJS? The target site is using socket.io.
I need to consume the data and trigger alerts based on keywords in the data. I'm considering the Goutte library and will be building the scraper in PHP.
Socket.io is not exactly the same as websockets. Since you know they use socket.io i'm focussing on that. The easiest way to scrape this socket is using the socket.io client.
Put this on your page:
<script src="https://github.com/LearnBoost/socket.io-client/blob/0.9/dist/socket.io.js"></script>
<script src="scraper.js"></script>
Create file scraper.js:
var keywords = /foo|bar/ig;
var socket = io.connect('http://host-to-scrape:portnumber/path');
socket.on('<socket.io-eventname>', function (data) {
// The scraped data is in 'data', do whatever you want with it
console.log(data);
// Assuming data.body contains a string containing keywords:
if(keywords.test(data.body)) callOtherFunction(data.body);
// Talk back:
// socket.emit('eventname', { my: 'data' });
});
UPDATE 6-1-2014
Instead of running this on the server it looks like your trying to run this in a browser window, looking at the StackOverflow question you referenced below. So I removed everything about NodeJS as that is not needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With