The site http://openbook.etoro.com/#/main/ has an live feed what is generated by javascript via XHR keep-alive requests and getting answers from server as gzip compressed JSON string.
I want capture the feed into a file.
The usual way (WWW::Mech..) is (probably) not viable because the need of reverese engineering all Javascripts in the page and simulating the browser is really hard task, so, looking for an alternative solution.
My idea is using a Man-in-the-middle tactics, so the broswser will do his work and i want capture the communication via an perl proxy - dedicated only for this task.
I'm able catch the initial communication, but not the feed itself. The proxy working OK, because in the browser the feed is running only my filers not works.
use HTTP::Proxy;
use HTTP::Proxy::HeaderFilter::simple;
use HTTP::Proxy::BodyFilter::simple;
use Data::Dumper;
use strict;
use warnings;
my $proxy = HTTP::Proxy->new(
port => 3128, max_clients => 100, max_keep_alive_requests => 100
);
my $hfilter = HTTP::Proxy::HeaderFilter::simple->new(
sub {
my ( $self, $headers, $message ) = @_;
print STDERR "headers", Dumper($headers);
}
);
my $bfilter = HTTP::Proxy::BodyFilter::simple->new(
filter => sub {
my ( $self, $dataref, $message, $protocol, $buffer ) = @_;
print STDERR "dataref", Dumper($dataref);
}
);
$proxy->push_filter( response => $hfilter); #header dumper
$proxy->push_filter( response => $bfilter); #body dumper
$proxy->start;
Firefox is configured using the above proxy for all communication.
The feed is running in the browser, so the proxy feeding it with data. (When i stop the proxy, the feed is stopping too). Randomly (can't figure when) i getting the following error:
[Tue Jul 10 17:13:58 2012] (42289) ERROR: Getting request failed: Client closed
Can anybody show me a way, how to construt the correct HTTP::Proxy filter for Dumper all communication between the browser and the server regardles of keep_alive XHR?
Here's something that I think does what you're after:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use HTTP::Proxy;
use HTTP::Proxy::BodyFilter::complete;
use HTTP::Proxy::BodyFilter::simple;
use JSON::XS qw( decode_json );
use Data::Dumper qw( Dumper );
my $proxy = HTTP::Proxy->new(
port => 3128,
max_clients => 100,
max_keep_alive_requests => 100,
);
my $filter = HTTP::Proxy::BodyFilter::simple->new(
sub {
my ( $self, $dataref, $message, $protocol, $buffer ) = @_;
return unless $$dataref;
my $content_type = $message->headers->content_type or return;
say "\nContent-type: $content_type";
my $data = decode_json( $$dataref );
say Dumper( $data );
}
);
$proxy->push_filter(
method => 'GET',
mime => 'application/json',
response => HTTP::Proxy::BodyFilter::complete->new,
response => $filter
);
$proxy->start;
I don't think you need a separate header filter because you can access any headers you want to look at using $message->headers
in the body filter.
You'll note that I pushed two filters onto the pipeline. The first one is of type HTTP::Proxy::BodyFilter::complete
and its job is to collect up the chunks of response and ensure that the real filter that follows always gets a complete message in $dataref
. However foreach chunk that's received and buffered, the following filter will be called and passed an empty $dataref
. My filter ignores these by returning early.
I also set up the filter pipeline to ignore everything except GET requests that resulted in JSON responses - since these seem to be the most interesting.
Thanks for asking this question - it was an interesting little problem and you seemed to have done most of the hard work already.
Set the mime
parameter, default is to filter text types only.
$proxy->push_filter(response => $hfilter, mime => 'application/json');
$proxy->push_filter(response => $bfilter, mime => 'application/json');
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With