Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I parse a page with html5ever, modify the DOM, and serialize it?

I would like to parse a web page, insert anchors at certain positions and render the modified DOM out again in order to generate docsets for Dash. Is this possible?

From the examples included in html5ever, I can see how to read an HTML file and do a poor man's HTML output, but I don't understand how I can modify the RcDom object I retrieved.

I would like to see a snippet inserting an anchor element (<a name="foo"></a>) to an RcDom.

Note: this is a question regarding Rust and html5ever specifically ... I know how to do it in other languages or simpler HTML parsers.

like image 774
kesselborn Avatar asked Aug 09 '16 20:08

kesselborn


1 Answers

Here is some code that parses a document, adds an achor to the link and prints the new document:

extern crate html5ever;

use html5ever::{ParseOpts, parse_document};
use html5ever::tree_builder::TreeBuilderOpts;
use html5ever::rcdom::RcDom;
use html5ever::rcdom::NodeEnum::Element;
use html5ever::serialize::{SerializeOpts, serialize};
use html5ever::tendril::TendrilSink;

fn main() {
    let opts = ParseOpts {
        tree_builder: TreeBuilderOpts {
            drop_doctype: true,
            ..Default::default()
        },
        ..Default::default()
    };
    let data = "<!DOCTYPE html><html><body><a href=\"foo\"></a></body></html>".to_string();
    let dom = parse_document(RcDom::default(), opts)
        .from_utf8()
        .read_from(&mut data.as_bytes())
        .unwrap();

    let document = dom.document.borrow();
    let html = document.children[0].borrow();
    let body = html.children[1].borrow(); // Implicit head element at children[0].

    {
        let mut a = body.children[0].borrow_mut();
        if let Element(_, _, ref mut attributes) = a.node {
            attributes[0].value.push_tendril(&From::from("#anchor"));
        }
    }

    let mut bytes = vec![];
    serialize(&mut bytes, &dom.document, SerializeOpts::default()).unwrap();
    let result = String::from_utf8(bytes).unwrap();
    println!("{}", result);
}

This prints the following:

<html><head></head><body><a href="foo#anchor"></a></body></html>

As you can see, we can navigate through the child nodes via the children attribute.

And we can change an attribute present in the vector of attributes of an Element.

like image 123
antoyo Avatar answered Nov 16 '22 09:11

antoyo