I am parsing an html file using nokogiri and modifying it and then outputting it to a file like this:
htext= File.open(inputOpts.html_file).read
h_doc = Nokogiri::HTML(htext)
File.open(outputfile, 'w+') do |file|
file.write(h_doc)
end
The output file contains the first line as:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
I do not want this because I am embedding the html in a different file and this tag is causing issues.
Question is how do I remove this from h_doc.
Depending on what you are trying to do, you could parse your HTML as a DocumentFragment
:
h_doc = Nokogiri::HTML::DocumentFragment.parse(htext)
When calling to_s
or to_html
on a fragment the doctype line will be omitted, as will the <html>
and <body>
tags that Nokogiri adds if they aren’t already present.
It depends on your needs. If all you need is the body then
h_doc.at_xpath("//body") #this will just pull the data from the <body></body> tags
If you need to collect the <head>
too and just avoid the <DOCTYPE>
then
#this will capture everything between the <head></head> and <body></body> tags
h_doc.xpath("//head") + h_doc.xpath("//body")
So something like this
h_doc = Nokogiri::HTML(open(input_opts.html_file))
File.open(outputfile,'w+') do |file|
#for just <body>
file << h_doc.at_xpath("//body").to_s
#for <head> and <body>
file << (h_doc.xpath("//head") + h_doc.xpath("//body")).to_s
end
Notice for body
I used #at_xpath
as this will return a Nokogiri::Element
but when combining them I used #xpath
becuase this will return a Nokogiri::XML::NodeSet
. No need to worry this part is just for the combination and the html will come out the same e.g. h_doc.at_xpath("//head").to_s == h_doc.xpath("//head").to_s #=> true
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With