Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML page content in a stream with hyper and html5ever

I'm trying to parse the HTML response of an HTTP request. I'm using hyper for the requests and html5ever for the parsing. The HTML will be pretty large and I don't need to fully parse it -- I just need to identify some data from tags so I would prefer to stream it. Conceptually I want to do something like:

# bash
curl url | read_dom

/* javascript */
http.get(url).pipe(parser);
parser.on("tag", /* check tag name, attributes, and act */)

What I have come up with so far is:

extern crate hyper;
extern crate html5ever;

use std::default::Default
use hyper::Client;
use html5ever::parse_document;
use html5ever::rcdom::{RcDom};

fn main() {
    let client = Client::new();

    let res = client.post(WEBPAGE)
        .header(ContentType::form_url_encoded())
        .body(BODY)
        .send()
        .unwrap();

    res.read_to_end(parse_document(RcDom::default(),
      Default::default().from_utf8().unwrap()));
}

It seems like read_to_end is the method I want to call on the response to read the bytes, but it is unclear to me how to pipe this to the HTML document reader ... if this is even possible.

The documentation for parse_document says to use from_utf8 or from_bytes if the input is in bytes (which it is).

It seems that I need to create a sink from the response, but this is where I am stuck. It's also unclear to me how I can create events to listen for tag starting which is what I am interested in.

I've looked at this example of html5ever which seems to do what I want and walks the DOM, but I can't get this example itself to run -- either it's outdated or tendril/html5ever is too new. This also seems to parse the HTML as a whole rather than as a stream, but I'm not sure.

Is it possible to do what I want to do with the current implementation of these libraries?

like image 257
Explosion Pills Avatar asked Feb 26 '16 14:02

Explosion Pills


1 Answers

Sorry for the lack of tutorial-like documentation for html5ever and tendril…

Unless you’re 100% sure your content is in UTF-8, use from_bytes rather than from_utf8. They return something that implements TendrilSink which allows you to provide the input incrementally (or not).

The std::io::Read::read_to_end method takes a &mut Vec<u8>, so it doesn’t work with TendrilSink.

At the lowest level, you can call the TendrilSink::process method once per &[u8] chunk, and then call TendrilSink::finish.

To avoid doing that manually, there’s also the TendrilSink::read_from method that takes &mut R where R: std::io::Read. Since hyper::client::Response implements Read, you can use:

parse_document(RcDom::default(), Default::default()).from_bytes().read_from(&mut res)

To go beyond your question, RcDom is very minimal and mostly exists in order to test html5ever. I recommend using Kuchiki instead. It has more features (for tree traversal, CSS Selector matching, …) including optional Hyper support.

In your Cargo.toml:

[dependencies]
kuchiki = {version = "0.3.1", features = ["hyper"]}

In your code:

let document = kuchiki::parse_html().from_http(res).unwrap();
like image 83
Simon Sapin Avatar answered Oct 08 '22 21:10

Simon Sapin