How to scrape JSON data streamed via websockets on a target site

Question

I've been asked to scrape a site which receives data via websockets and then renders that to the page via javascript/jquery. Is it possible to bypass the middleman (the DOM) and consume/scrape the data coming over the socket? Might this be possible with a headless webkit like phantomJS? The target site is using socket.io.

I need to consume the data and trigger alerts based on keywords in the data. I'm considering the Goutte library and will be building the scraper in PHP.

Herman · Accepted Answer

Socket.io is not exactly the same as websockets. Since you know they use socket.io i'm focussing on that. The easiest way to scrape this socket is using the socket.io client.

Put this on your page:

<script src="https://github.com/LearnBoost/socket.io-client/blob/0.9/dist/socket.io.js"></script>
<script src="scraper.js"></script>

Create file scraper.js:

var keywords = /foo|bar/ig;
var socket = io.connect('http://host-to-scrape:portnumber/path');
socket.on('<socket.io-eventname>', function (data) {
  // The scraped data is in 'data', do whatever you want with it
  console.log(data);

  // Assuming data.body contains a string containing keywords:
  if(keywords.test(data.body)) callOtherFunction(data.body);

  // Talk back:
  // socket.emit('eventname', { my: 'data' });
});

UPDATE 6-1-2014

Instead of running this on the server it looks like your trying to run this in a browser window, looking at the StackOverflow question you referenced below. So I removed everything about NodeJS as that is not needed.

How to scrape JSON data streamed via websockets on a target site

Tags:

php

websocket

socket.io

web-scraping

codecowboy

1 Answers

Herman

Recent Activity

Donate For Us

How to scrape JSON data streamed via websockets on a target site

Tags:

php

websocket

socket.io

web-scraping

codecowboy

1 Answers

Herman

Related questions

Recent Activity

Donate For Us