Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape JSON data streamed via websockets on a target site

I've been asked to scrape a site which receives data via websockets and then renders that to the page via javascript/jquery. Is it possible to bypass the middleman (the DOM) and consume/scrape the data coming over the socket? Might this be possible with a headless webkit like phantomJS? The target site is using socket.io.

I need to consume the data and trigger alerts based on keywords in the data. I'm considering the Goutte library and will be building the scraper in PHP.

like image 909
codecowboy Avatar asked Nov 08 '13 19:11

codecowboy


1 Answers

Socket.io is not exactly the same as websockets. Since you know they use socket.io i'm focussing on that. The easiest way to scrape this socket is using the socket.io client.

Put this on your page:

<script src="https://github.com/LearnBoost/socket.io-client/blob/0.9/dist/socket.io.js"></script>
<script src="scraper.js"></script>

Create file scraper.js:

var keywords = /foo|bar/ig;
var socket = io.connect('http://host-to-scrape:portnumber/path');
socket.on('<socket.io-eventname>', function (data) {
  // The scraped data is in 'data', do whatever you want with it
  console.log(data);

  // Assuming data.body contains a string containing keywords:
  if(keywords.test(data.body)) callOtherFunction(data.body);

  // Talk back:
  // socket.emit('eventname', { my: 'data' });
});

UPDATE 6-1-2014

Instead of running this on the server it looks like your trying to run this in a browser window, looking at the StackOverflow question you referenced below. So I removed everything about NodeJS as that is not needed.

like image 52
Herman Avatar answered Sep 28 '22 19:09

Herman