i fetch one html fragment like
"<li>市 场 价"
which contains "
", but after calling to_s
of Nokogiri NodeSet, it becomes
"<li>市 场 价"
, i want to keep the original html fragment, and tried to set :save_with option
for to_s
method, but failed.
can someone encounter the same problem and give me help? thank you in advance.
I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.
nbsp = Nokogiri::HTML(" ").text
text.gsub(nbsp, " ")
In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a " ", so you could do something like:
nbsp = Nokogiri::HTML(" ").text
html.gsub(nbsp, " ")
I think the problem is how you're looking at the string. It will look like a space, but it's not quite the same:
require 'nokogiri'
doc = Nokogiri::HTML('"<li>市 场 价"')
(doc % 'li').content.chars.to_a[1].ord # => 160
(doc % 'li').to_html # => "<li>市 场 价\"</li>"
A regular space is 32
, 0x20
or ' '
. 160
is the decimal value for a non-breaking-space, which is what
converts to after you use Nokogiri's various inner_text
, content
, text
or to_s
tags. It's no longer a HTML entity-encoding, but it's still a non-breaking space. I think Nokogiri's conversion from the entity-encoding is the appropriate behavior when asking for a stringification.
There might be a flag to tell Nokogiri to NOT decode the value, but I'm not aware of it off-hand. You can check on Nokogiri's mail-list that I mentioned in the comment above, to see if there is a flag. I can see an advantage for Nokogiri to not do the decode also so if there isn't such a flag it would be nice occasionally.
Now, all that said, I think the to_html
method SHOULD return the value to its entity-encoded value, since a non-breaking space is a nasty thing to encounter in a HTML stream. And that I think you should mention on the mail-list or maybe even as a bug. I think it's an inappropriate result.
http://groups.google.com/group/nokogiri-talk/msg/0b81ef0dc180dc74
Okay, I can explain the behavior now. Basically, the problem boils down to encoding.
In Ruby 1.9, we examine the encoding of the string you're feeding to Nokogiri. If the input string is "utf-8", the document is assumed to be a UTF-8 document. When you output the document, since " " can be represented as a UTF-8 character, it is output as that UTF-8 character.
In 1.8, since we cannot detect the encoding of the document, we assume binary encoding and allow libxml2 to detect the encoding. If you set the encoding of the input document to binary, it will give you back the entities you want. Here is some code to demo:
require 'nokogiri'
html = '<body>hello world</body>'
f = Nokogiri.HTML(html)
node = f.css('body')
p node.inner_html
f = Nokogiri.HTML(html.encode('ASCII-8BIT'))
node = f.css('body')
p node.inner_html
I posted a youtube video too! :-)
http://www.youtube.com/watch?v=X2SzhXAt7V4
Aaron Patterson
Your sample text isn't ASCII-8BIT
so try changing that encoding string to the Unicode character set name and see if inner_html
will return an entity-encoded value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With