how to make Nokogiri not to convert to space

Question

i fetch one html fragment like

"<li>市&nbsp;场&nbsp;价"

which contains " ", but after calling to_s of Nokogiri NodeSet, it becomes

"<li>市 场 价"

, i want to keep the original html fragment, and tried to set :save_with option for to_s method, but failed.

can someone encounter the same problem and give me help? thank you in advance.

Mike Dotterer · Accepted Answer

I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.

nbsp = Nokogiri::HTML("&nbsp;").text
text.gsub(nbsp, " ")

In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a " ", so you could do something like:

nbsp = Nokogiri::HTML("&nbsp;").text
html.gsub(nbsp, "&nbsp;")

the Tin Man · Answer

I think the problem is how you're looking at the string. It will look like a space, but it's not quite the same:

require 'nokogiri'

doc = Nokogiri::HTML('"<li>市&nbsp;场&nbsp;价"')
(doc % 'li').content.chars.to_a[1].ord # => 160
(doc % 'li').to_html # => "<li>市 场 价\"</li>"

A regular space is 32, 0x20 or ' '. 160 is the decimal value for a non-breaking-space, which is what   converts to after you use Nokogiri's various inner_text, content, text or to_s tags. It's no longer a HTML entity-encoding, but it's still a non-breaking space. I think Nokogiri's conversion from the entity-encoding is the appropriate behavior when asking for a stringification.

There might be a flag to tell Nokogiri to NOT decode the value, but I'm not aware of it off-hand. You can check on Nokogiri's mail-list that I mentioned in the comment above, to see if there is a flag. I can see an advantage for Nokogiri to not do the decode also so if there isn't such a flag it would be nice occasionally.

Now, all that said, I think the to_html method SHOULD return the value to its entity-encoded value, since a non-breaking space is a nasty thing to encounter in a HTML stream. And that I think you should mention on the mail-list or maybe even as a bug. I think it's an inappropriate result.

http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74

Okay, I can explain the behavior now. Basically, the problem boils down to encoding.

In Ruby 1.9, we examine the encoding of the string you're feeding to Nokogiri. If the input string is "utf-8", the document is assumed to be a UTF-8 document. When you output the document, since " " can be represented as a UTF-8 character, it is output as that UTF-8 character.

In 1.8, since we cannot detect the encoding of the document, we assume binary encoding and allow libxml2 to detect the encoding. If you set the encoding of the input document to binary, it will give you back the entities you want. Here is some code to demo:

 require 'nokogiri' 
 html = '<body>hello &nbsp; world</body>' 
 f    = Nokogiri.HTML(html) 
 node = f.css('body') 
 p node.inner_html 
 f    = Nokogiri.HTML(html.encode('ASCII-8BIT')) 
 node = f.css('body') 
 p node.inner_html

I posted a youtube video too! :-)

http://www.youtube.com/watch?v=X2SzhXAt7V4

Aaron Patterson

Your sample text isn't ASCII-8BIT so try changing that encoding string to the Unicode character set name and see if inner_html will return an entity-encoded value.

how to make Nokogiri not to convert   to space

Tags:

ruby

html-entities

nokogiri

ywenbo

2 Answers

Mike Dotterer

the Tin Man

Recent Activity

Donate For Us

how to make Nokogiri not to convert &nbsp; to space