Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stop Nokogiri from adding DOCTYPE and meta tags?

I'm trying to use Nokogiri to convert some template files from one format to another. But it keeps adding tags. I'm trying to prevent it from adding Doctype and meta tags, but can't figure it out. I've tried

@doc = Nokogiri::HTML.parse(r)

but that adds the tags. I've also tried

@doc = Nokogiri::HTML.fragment(r)

as suggested in "How to prevent Nokogiri from adding <DOCTYPE> tags?", but that removes any <html>, <head>, or <body> tags that are in the document.

If it matters, my code for reading the file is:

f = File.read(infile)
r = f.gsub(/<tmpl_var ([^>]*)>/, '{{{\1}}}')
@doc = Nokogiri::HTML.fragment(r)

I need to do a gsub beforehand because I need to replace <tmpl_var> tags which aren't proper HTML and cause more problems.

When using HTML.fragment(r), I do get an htmlParseStartTag: misplaced <html> tag error (as well as similar errors for <body> and <head>).

Is there a way to prevent it from making these additions?

An example conversion:

Before:

<html>
    <head>
        <script>
            var x = "y";
        </script>
    </head>
    <body>
        <div>
            Stuff
        </div>
   </body>
</html>

After using Parse:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <script>
            var x = "y";
        </script>
    </head>
    <body>
        <div>
            Stuff
        </div>
    </body>
</html>

After using HTML.fragment or HTML::DocumentFragment.parse:

<script>
    var x = "y";
</script>

<div>
    Stuff
</div>

In this case, I want it to just output the before section. (In the real script I make a bunch of changes though).

like image 741
CSturgess Avatar asked Sep 23 '14 15:09

CSturgess


2 Answers

Nokogiri can be told to not add the standard HTML headers. Consider these:

require 'nokogiri'

doc = Nokogiri::HTML('<p>foo</p>')
doc.to_html # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"

doc = Nokogiri::HTML.fragment('<p>foo</p>')
doc.to_html # => "<p>foo</p>"

tmpl_var is a bad tag name in HTML, as is {{{\1}}}, so asking Nokogiri to try to parse either will result in problems:

doc = Nokogiri::HTML.fragment('<templ_var p1="baz">foo</templ_var>')
doc.errors # => [#<Nokogiri::XML::SyntaxError: Tag templ_var invalid>]

But you can still munge the DOM:

doc.to_html # => "<templ_var p1=\"baz\">foo</templ_var>"
doc.search('templ_var').each { |t| t.name = 'bar'}
doc.to_html # => "<bar p1=\"baz\">foo</bar>"

Or:

doc.to_html # => "<div><templ_var p1=\"baz\">foo</templ_var></div>"
doc.search('templ_var').each { |t| t.replace('{{{\1}}}') }
doc.to_html # => "<div>{{{\\1}}}</div>"

Putting that stuff together, plus a bit of chicanery:

doc = Nokogiri::HTML.fragment('<div><templ_var p1="baz">foo</templ_var></div>')

doc.to_html # => "<div><templ_var p1=\"baz\">foo</templ_var></div>"

doc.search('templ_var').each { |t| t.replace('{{{\1}}}') }
doc.to_html # => "<div>{{{\\1}}}</div>"

header = Nokogiri::XML.fragment('<html><body>')
header.at('body').children = doc
header.to_html # => "<html><body><div>{{{\\1}}}</div></body></html>"

So, I'd go after it something like that.

Now, why is Nokogiri stripping the <html> tag when parsing a fragment? I don't know. It leaves <body> alone if <head> or <html> is missing:

Nokogiri::HTML.fragment('<p>foo<p>').to_html 
# => "<p>foo</p><p></p>"
Nokogiri::HTML.fragment('<body><p>foo<p></body>').to_html 
# => "<body>\n<p>foo</p>\n<p></p>\n</body>"

But it gets funky if <head> or <html> exists:

Nokogiri::HTML.fragment('<head><style></style></head><body><p>foo<p></body>').to_html 
# => "<style></style><p>foo</p><p></p>"
Nokogiri::HTML.fragment('<html><head><style></style></head><body><p>foo<p></body></html>').to_html 
# => "<style></style><p>foo</p><p></p>"

That smells like a bug in Nokogiri to me as I haven't seen anything to document that behavior.

like image 169
the Tin Man Avatar answered Oct 05 '22 12:10

the Tin Man


You can get around this by using Nokogiri::XML::DocumentFragment instead of Nokogiri::HTML::DocumentFragment. The XML version won't remove the html, head, or body tags.

like image 44
CSturgess Avatar answered Oct 05 '22 12:10

CSturgess