Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a DTD file in Ruby

I was trying to convert a DTD file to a YAML file, and I've tried loading it both in libXML and Nokogiri, but it seems that a DTD file is not a valid XML file. I'm fine with using any third-party gems as long as I can parse the DTD file.

My attempt at conversion:

wget "http://xml.evernote.com/pub/enml2.dtd"
irb
require 'nokogiri'
xml = Nokogiri::XML::Document.parse('enml2.dtd')
xml.to_yaml
=> "--- !ruby/object:Nokogiri::XML::Document\ndecorators: \nnode_cache: []\nerrors:\n- !ruby/exception:Nokogiri::XML::SyntaxError\n  message: |\n    Start tag expected, '<' not found\n  domain: 1\n  code: 4\n  level: 3\n  file: \n  line: 1\n  str1: \n  str2: \n  str3: \n  int1: 0\n  column: 1\n"

Any online XML validator also returns the error "Start tag expected". I assume it is because all valid XML docs start with <?xml, which DTD files seem to be missing. This is what has led me to the conclusion that all DTD files are invalid XML files, however, it does feel weird that the XML definition syntax itself was not defined as valid XML. Why?

I'm parsing the DTD file to remove invalid attributes from an XML file, to know which attributes to keep and which to remove, so I need a way to parse the DTD file.

And ultimately, this is all just a step in trying to convert HTML to ENML (Evernote Markup Language). The steps involved in it include:

  • Converting HTML to valid XHTML
  • Converting the body to an en-note element
  • Removing invalid tags and attributes as per the dtd file
  • Validating the enml file against the dtd

I'm currently thinking to just copy the disallowed attributes and tags from "Understanding the Evernote Markup Language" and using that to validate my XHTML, but I'd prefer to use the DTD as my source.

The Nokogiri DTD class is a Node class for holding an inline DTD node and validating against it. In my case, I have an external DTD file specified using the SYSTEM attribute, which Nokogiri does not seem to support. And even if it did work, all I would get is validation.

I did get validation to work properly using:

#dtd = XML::Dtd.new File.read Rails.root.join('lib', 'assets','enml2.dtd')
#enml_document = XML::Document.string enml
#ret = enml_document.validate dtd

I haven't tried REXML. I will give that a go and report back.

I'm trying to convert an HTML document to a XML document that validates with the given DTD. Most HTML elements and attributes are not allowed in the ENML schema, so I have to strip them, or remove them. I also need to know which attributes are allowed and which are not, so that I can parse the XML properly and remove/sanitize the offending elements and attributes.

For the cleanup purpose, I'm using Loofah, but to use it, I need a list of tag->attributes (which attributes are available for each tag). Instead of making multiple passes validating the doc, which I am doing at the end of cleanup, I'm just looping through each XML tag, and cleaning them up. But to know how to clean them, I need to know which tags and elements are supported in the valid schema. Thus, I need to parse the DTD file.

From what I understand, XLST is the right tool for the job, but I'm not comfortable enough to use it.

like image 714
Nemo Avatar asked Jul 12 '14 16:07

Nemo


1 Answers

However, it does feel weird to me that the xml definition syntax itself was not defined as valid XML. I'd love to know any reasons behind this.

DTDs are a holdover from SGML, the precursor of XML, so it is actually not very strange that DTDs are not XML files. Keeping DTDs and their particular syntax was a deliberate decision when XML was created.

More modern schema languages such as W3C XML Schema and RELAX NG do use XML syntax.


The reason I'm parsing the DTD file is that I want to remove invalid attributes from an XML file. To know which attributes to keep and which to remove, I need a way to parse the DTD file. (from question)

I am just looking for a way to parse DTD files, not just validate using them, because I want to perform custom cleanup and validation using the dtd. (from bounty text)

I don't really understand what you mean by "custom cleanup". I also don't see the point in trying to parse the DTD in the first place.

In order to find out if any elements or attributes in an XML file are invalid (if they break the rules in an associated DTD), you need to parse the XML file using a validating XML parser. The parser will then tell you if there are any errors that need to be fixed.

Nokogiri is based on libxml2 which provides a validating parser. It does support external DTDs that are specified using <!DOCTYPE foo SYSTEM "bar.dtd"> syntax (how to make this work is shown in a comment on the issue that you refer to: https://github.com/sparklemotion/nokogiri/issues/440#issuecomment-3031164).

Here is how the validation can be done:

require 'nokogiri'

xml = File.read("yourfile.xml")
options = Nokogiri::XML::ParseOptions::DTDLOAD   # Needed for the external DTD to be loaded
doc = Nokogiri::XML::Document.parse(xml, nil, nil, options)
puts doc.external_subset.validate(doc) 

If there is no output from this code, then the XML document is valid against the DTD.

like image 69
mzjn Avatar answered Sep 18 '22 20:09

mzjn