Parsing HTML page content in a stream with hyper and html5ever

Question

I'm trying to parse the HTML response of an HTTP request. I'm using hyper for the requests and html5ever for the parsing. The HTML will be pretty large and I don't need to fully parse it -- I just need to identify some data from tags so I would prefer to stream it. Conceptually I want to do something like:

# bash
curl url | read_dom

/* javascript */
http.get(url).pipe(parser);
parser.on("tag", /* check tag name, attributes, and act */)

What I have come up with so far is:

extern crate hyper;
extern crate html5ever;

use std::default::Default
use hyper::Client;
use html5ever::parse_document;
use html5ever::rcdom::{RcDom};

fn main() {
    let client = Client::new();

    let res = client.post(WEBPAGE)
        .header(ContentType::form_url_encoded())
        .body(BODY)
        .send()
        .unwrap();

    res.read_to_end(parse_document(RcDom::default(),
      Default::default().from_utf8().unwrap()));
}

It seems like read_to_end is the method I want to call on the response to read the bytes, but it is unclear to me how to pipe this to the HTML document reader ... if this is even possible.

The documentation for parse_document says to use from_utf8 or from_bytes if the input is in bytes (which it is).

It seems that I need to create a sink from the response, but this is where I am stuck. It's also unclear to me how I can create events to listen for tag starting which is what I am interested in.

I've looked at this example of html5ever which seems to do what I want and walks the DOM, but I can't get this example itself to run -- either it's outdated or tendril/html5ever is too new. This also seems to parse the HTML as a whole rather than as a stream, but I'm not sure.

Is it possible to do what I want to do with the current implementation of these libraries?

Simon Sapin · Accepted Answer

Sorry for the lack of tutorial-like documentation for html5ever and tendril…

Unless you’re 100% sure your content is in UTF-8, use from_bytes rather than from_utf8. They return something that implements TendrilSink which allows you to provide the input incrementally (or not).

The std::io::Read::read_to_end method takes a &mut Vec<u8>, so it doesn’t work with TendrilSink.

At the lowest level, you can call the TendrilSink::process method once per &[u8] chunk, and then call TendrilSink::finish.

To avoid doing that manually, there’s also the TendrilSink::read_from method that takes &mut R where R: std::io::Read. Since hyper::client::Response implements Read, you can use:

parse_document(RcDom::default(), Default::default()).from_bytes().read_from(&mut res)

To go beyond your question, RcDom is very minimal and mostly exists in order to test html5ever. I recommend using Kuchiki instead. It has more features (for tree traversal, CSS Selector matching, …) including optional Hyper support.

In your Cargo.toml:

[dependencies]
kuchiki = {version = "0.3.1", features = ["hyper"]}

In your code:

let document = kuchiki::parse_html().from_http(res).unwrap();

Parsing HTML page content in a stream with hyper and html5ever

Tags:

rust

hyper

html5ever

Explosion Pills

1 Answers

Simon Sapin

Recent Activity

Donate For Us

Parsing HTML page content in a stream with hyper and html5ever

Tags:

rust

hyper

html5ever

Explosion Pills

1 Answers

Simon Sapin

Related questions

Recent Activity

Donate For Us