Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why can't REXML parse CDATA preceded by a line break?

I'm very new to Ruby, and trying to parse an XML document with REXML that has been previously pretty-printed (by REXML) with some slightly erratic results.

Some CDATA sections have a line break after the opening XML tag, but before the opening of the CDATA block, in these cases REXML parses the text of the tag as empty.

  • Any idea if I can get REXML to read these lines?
  • If not, could I re-write them before hand with a regex or something?
  • Is this even Valid XML?

Here's an example XML document (much abridged):

<?xml version="1.0" encoding="utf-8"?>
<root-tag>
    <content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
    <content type="base64">
        <![CDATA[VGhpcyB3b250IHdvcms=]]></content>

    <content><![CDATA[This will work]]></content>
    <content>
        <![CDATA[This will not appear]]></content>

    <content>
        Seems happy</content>
    <content>Obviously no problem</content>
</root-tag>

and here's my Ruby script (distilled down to a minimal example):

require 'rexml/document'
require 'base64'
include REXML

module RexmlSpike
  file = File.new("ex.xml")
  doc = Document.new file
  doc.elements.each("root-tag/content") do |contentElement|
    if contentElement.attributes["type"] == "base64"
      puts "decoded: " << Base64.decode64(contentElement.text)
    else
      puts "raw: " << contentElement.text
    end
  end
  puts "Finished."
end

The output I get is:

>> ruby spike.rb
  decoded: Well done! It works :)
  decoded:
  raw: This will work
  raw:

  raw:
          Seems happy
  raw: Obviously no problem
  Finished.

I'm using Ruby 1.9.3p392 on OSX Lion. The object of the exercise is ultimately to parse comments from some BlogML into the custom import XML used by Disqus.

like image 505
Andrew M Avatar asked Oct 12 '25 08:10

Andrew M


2 Answers

Why

Having anything before the <![CDATA[]]> overrides whatever is in the <![CDATA[]]>. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>, it is because text is nil.


Solution

If you look at the documentation for Element, you'll see that it has a function called cdatas() that:

Get an array of all CData children. IMMUTABLE.

So, in your example, if you do an inner loop on contentElement.cdatas() you would see the content of all your missing tags.

like image 74
lightswitch05 Avatar answered Oct 15 '25 12:10

lightswitch05


I'd recommend using Nokogiri, which is the defacto XML/HTML parser for Ruby. Using it to access the contents of the <content> tags, I get:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
    <content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
    <content type="base64">
        <![CDATA[VGhpcyB3b250IHdvcms=]]></content>

    <content><![CDATA[This will work]]></content>
    <content>
        <![CDATA[This will not appear]]></content>

    <content>
        Seems happy</content>
    <content>Obviously no problem</content>
</root-tag>
EOT

doc.search('content').each do |n|
  puts n.content
end

Which outputs:

V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==

        VGhpcyB3b250IHdvcms=
This will work

        This will not appear

        Seems happy
Obviously no problem
like image 36
the Tin Man Avatar answered Oct 15 '25 14:10

the Tin Man