Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to make Nokogiri not to convert   to space

i fetch one html fragment like

"<li>市&nbsp;场&nbsp;价"

which contains "&nbsp;", but after calling to_s of Nokogiri NodeSet, it becomes

"<li>市 场 价"

, i want to keep the original html fragment, and tried to set :save_with option for to_s method, but failed.

can someone encounter the same problem and give me help? thank you in advance.

like image 652
ywenbo Avatar asked Dec 18 '10 01:12

ywenbo


2 Answers

I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.

nbsp = Nokogiri::HTML("&nbsp;").text
text.gsub(nbsp, " ")

In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a "&nbsp;", so you could do something like:

nbsp = Nokogiri::HTML("&nbsp;").text
html.gsub(nbsp, "&nbsp;")
like image 177
Mike Dotterer Avatar answered Oct 21 '22 03:10

Mike Dotterer


I think the problem is how you're looking at the string. It will look like a space, but it's not quite the same:

require 'nokogiri'

doc = Nokogiri::HTML('"<li>市&nbsp;场&nbsp;价"')
(doc % 'li').content.chars.to_a[1].ord # => 160
(doc % 'li').to_html # => "<li>市 场 价\"</li>"

A regular space is 32, 0x20 or ' '. 160 is the decimal value for a non-breaking-space, which is what &nbsp; converts to after you use Nokogiri's various inner_text, content, text or to_s tags. It's no longer a HTML entity-encoding, but it's still a non-breaking space. I think Nokogiri's conversion from the entity-encoding is the appropriate behavior when asking for a stringification.

There might be a flag to tell Nokogiri to NOT decode the value, but I'm not aware of it off-hand. You can check on Nokogiri's mail-list that I mentioned in the comment above, to see if there is a flag. I can see an advantage for Nokogiri to not do the decode also so if there isn't such a flag it would be nice occasionally.

Now, all that said, I think the to_html method SHOULD return the value to its entity-encoded value, since a non-breaking space is a nasty thing to encounter in a HTML stream. And that I think you should mention on the mail-list or maybe even as a bug. I think it's an inappropriate result.


http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74

Okay, I can explain the behavior now. Basically, the problem boils down to encoding.

In Ruby 1.9, we examine the encoding of the string you're feeding to Nokogiri. If the input string is "utf-8", the document is assumed to be a UTF-8 document. When you output the document, since " " can be represented as a UTF-8 character, it is output as that UTF-8 character.

In 1.8, since we cannot detect the encoding of the document, we assume binary encoding and allow libxml2 to detect the encoding. If you set the encoding of the input document to binary, it will give you back the entities you want. Here is some code to demo:

 require 'nokogiri' 
 html = '<body>hello &nbsp; world</body>' 
 f    = Nokogiri.HTML(html) 
 node = f.css('body') 
 p node.inner_html 
 f    = Nokogiri.HTML(html.encode('ASCII-8BIT')) 
 node = f.css('body') 
 p node.inner_html 

I posted a youtube video too! :-)

http://www.youtube.com/watch?v=X2SzhXAt7V4

Aaron Patterson

Your sample text isn't ASCII-8BIT so try changing that encoding string to the Unicode character set name and see if inner_html will return an entity-encoded value.

like image 26
the Tin Man Avatar answered Oct 21 '22 05:10

the Tin Man