Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse an HTML fragment whitelisting some custom tags

I'm trying to parse an HTML fragment that contains a custom HTML tag using Nokogiri.

Example:

string = "<div>hello</div>\n<custom-tag></custom-tag>"

I tried to load it in many ways, but none is optimal.

If I use Nokogiri::HTML:

doc = Nokogiri::HTML(string)

When I use to_html, it adds a doctype and an html tag that wraps the content. It's undesired.

If I use Nokogiri::XML:

doc = Nokogiri::XML(string)

I got Error at line 2: Extra content at the end of the document, since in XML there must be a root tag that wraps all the document content. If I try to save this content again, The output is <div>hello</div> (every tag after the first is removed)

I tried also doc = Nokogiri::HTML.fragment:

doc = Nokogiri::HTML.fragment(string)

But it complains about the custom-tag.

How can I make Nokogiri parse correctly with this HTML fragment?

like image 385
ProGM Avatar asked Mar 29 '16 08:03

ProGM


1 Answers

doc = Nokogiri::HTML.fragment(string) is the way to go, you can ignore doc.errors complaining about the invalid tag.

You are giving it invalid HTML, so you can't expect it to not report errors, but HTML parsers tend to be forgiving.

You can also use Nokogiri::XML.fragment, if you're sure the rest of it is well-formed. That won't give you errors about undefined tags.

like image 62
Dmitri Avatar answered Dec 07 '22 20:12

Dmitri