Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I pretty-print HTML with Nokogiri?

I wrote a web crawler in Ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants.

My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.

like image 642
Jarsen Avatar asked Dec 14 '09 03:12

Jarsen


People also ask

What is Nokogiri gem used for?

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby. It provides a sensible, easy-to-understand API for reading, writing, modifying, and querying documents. It is fast and standards-compliant by relying on native parsers like libxml2 (C) and xerces (Java).

Why does rails need Nokogiri?

Nokogiri is a dependency of rails-dom-testing which is required by Rails. As far as I see it rails-dom-testing is used to verify certain parts of a rendered HTML/CSS page. Nokogiri can be a great tool, but it's also a 800-pound gorilla. It's unpleasant that it's a Rails dependency IMHO.


2 Answers

The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:

  • Parse the document as XML
  • Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
  • Use to_xhtml or to_xml to specify pretty-printing parameters

In action:

html = '<section> <h1>Main Section 1</h1><p>Intro</p> <section> <h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p> </section><section> <h2>Subhead 1.2</h2><p>Meat</p> </section></section>'  require 'nokogiri' doc = Nokogiri::XML(html,&:noblanks) puts doc #=> <section> #=>   <h1>Main Section 1</h1> #=>   <p>Intro</p> #=>   <section> #=>     <h2>Subhead 1.1</h2> #=>     <p>Meat</p> #=>     <p>MOAR MEAT</p> #=>   </section> #=>   <section> #=>     <h2>Subhead 1.2</h2> #=>     <p>Meat</p> #=>   </section> #=> </section>  puts doc.to_xhtml( indent:3, indent_text:"." ) #=> <section> #=> ...<h1>Main Section 1</h1> #=> ...<p>Intro</p> #=> ...<section> #=> ......<h2>Subhead 1.1</h2> #=> ......<p>Meat</p> #=> ......<p>MOAR MEAT</p> #=> ...</section> #=> ...<section> #=> ......<h2>Subhead 1.2</h2> #=> ......<p>Meat</p> #=> ...</section> #=> </section> 
like image 181
Phrogz Avatar answered Oct 08 '22 14:10

Phrogz


By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print method is for the "pp" library and the output is useful for debugging only.

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

It comes down to this:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl")) html = Nokogiri(File.open("source.html")) puts xsl.apply_to(html).to_s 

It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

like image 24
mislav Avatar answered Oct 08 '22 13:10

mislav