When parsing HTML document, how Nokogiri handle <br>
tags? Suppose we have document that looks like this one:
<div>
Hi <br>
How are you? <br>
</div>
Do Nokogiri know that <br>
tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you?
is not a content of the first <br>
as it would be in XML).
You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you'd expect it:
require 'nokogiri'
doc = Nokogiri::HTML(<<-EOS
<div>
Hi <br>
How are you? <br>
</div>
EOS
)
doc.xpath("//br").each{ |e| puts e }
prints
<br>
<br>
Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.
Here's how Nokogiri behaves when parsing (malformed) XML:
require 'nokogiri'
doc = Nokogiri::XML("<div>Hello<br>World</div>")
puts doc.root
#=> <div>Hello<br>World</br></div>
Here's how Nokogiri behaves when parsing HTML:
require 'nokogiri'
doc = Nokogiri::HTML("<div>Hello<br>World</div>")
puts doc.root
#=> <html><body><div>Hello<br>World</div></body></html>
p doc.at('div').text
#=> "HelloWorld"
I'm assuming that by "something special" you mean that you want Nokogiri to treat it like a newline in the source text. A <br>
is not something special, and so appropriately Nokogiri does not treat it differently than any other element.
If you want it to be treated as a newline, you can do this:
doc.css('br').each{ |br| br.replace("\n") }
p doc.at('div').text
#=> "Hello\nWorld"
Similarly, if you wanted a space instead:
doc.css('br').each{ |br| br.replace(" ") }
p doc.at('div').text
#=> "Hello World"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With