I'm trying to parse the HTML response of an HTTP request. I'm using hyper for the requests and html5ever for the parsing. The HTML will be pretty large and I don't need to fully parse it -- I just need to identify some data from tags so I would prefer to stream it. Conceptually I want to do something like:
# bash
curl url | read_dom
/* javascript */
http.get(url).pipe(parser);
parser.on("tag", /* check tag name, attributes, and act */)
What I have come up with so far is:
extern crate hyper;
extern crate html5ever;
use std::default::Default
use hyper::Client;
use html5ever::parse_document;
use html5ever::rcdom::{RcDom};
fn main() {
let client = Client::new();
let res = client.post(WEBPAGE)
.header(ContentType::form_url_encoded())
.body(BODY)
.send()
.unwrap();
res.read_to_end(parse_document(RcDom::default(),
Default::default().from_utf8().unwrap()));
}
It seems like read_to_end
is the method I want to call on the response to read the bytes, but it is unclear to me how to pipe this to the HTML document reader ... if this is even possible.
The documentation for parse_document
says to use from_utf8
or from_bytes
if the input is in bytes (which it is).
It seems that I need to create a sink from the response, but this is where I am stuck. It's also unclear to me how I can create events to listen for tag starting which is what I am interested in.
I've looked at this example of html5ever which seems to do what I want and walks the DOM, but I can't get this example itself to run -- either it's outdated or tendril/html5ever is too new. This also seems to parse the HTML as a whole rather than as a stream, but I'm not sure.
Is it possible to do what I want to do with the current implementation of these libraries?
Sorry for the lack of tutorial-like documentation for html5ever and tendril…
Unless you’re 100% sure your content is in UTF-8, use from_bytes
rather than from_utf8
. They return something that implements TendrilSink
which allows you to provide the input incrementally (or not).
The std::io::Read::read_to_end
method takes a &mut Vec<u8>
, so it doesn’t work with TendrilSink
.
At the lowest level, you can call the TendrilSink::process
method once per &[u8]
chunk, and then call TendrilSink::finish
.
To avoid doing that manually, there’s also the TendrilSink::read_from
method that takes &mut R where R: std::io::Read
. Since hyper::client::Response
implements Read
, you can use:
parse_document(RcDom::default(), Default::default()).from_bytes().read_from(&mut res)
To go beyond your question, RcDom
is very minimal and mostly exists in order to test html5ever. I recommend using Kuchiki instead. It has more features (for tree traversal, CSS Selector matching, …) including optional Hyper support.
In your Cargo.toml
:
[dependencies]
kuchiki = {version = "0.3.1", features = ["hyper"]}
In your code:
let document = kuchiki::parse_html().from_http(res).unwrap();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With