Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse XHTML document with undefined entity

While coding with Python, if I had to load XHTML document with undefined entity, I would create a parser and update entity dict (i.e. nbsp):

import xml.etree.ElementTree as ET
parser = ET.XMLParser()
parser.entity['nbsp'] = ' '
tree = ET.parse(opener.open(url), parser=parser)

With VB.Net I tried to parse XHTML document as Linq XDocument:

Dim x As XDocument = XDocument.Load(url)

which raised XmlException:

Reference to undeclared entity 'nbsp'

Googling around I couldn't find any example how to update entity table or use simple means to be able to parse XHTML document with undefined entity.

How to solve this apparently simple problem?

like image 718
theta Avatar asked Apr 07 '14 18:04

theta


2 Answers

Entity resolution is done by the underlying parser which is here a standard XmlReader (or XmlTextReader).

Officially, you're supposed to declare entities in DTDs (see Oleg's answer here: Problem with XHTML entities), or load DTDs dynamically into your documents. There are some examples here on SO like this: How do I resolve entities when loading into an XDocument?

What you can also do is create a hacky XmlTextReader derived class that returns Text nodes when entities are detected, based on a dictionary, like I demonstrate here in the following sample code:

using (XmlTextReaderWithEntities reader = new XmlTextReaderWithEntities(MyXmlFile))
{
    reader.AddEntity("nbsp", "\u00A0");
    XDocument xdoc = XDocument.Load(reader);
}

...

public class XmlTextReaderWithEntities : XmlTextReader
{
    private string _nextEntity;
    private Dictionary<string, string> _entities = new Dictionary<string, string>();

    // NOTE: override other constructors for completeness
    public XmlTextReaderWithEntities(string path)
        : base(path)
    {
    }

    public void AddEntity(string entity, string value)
    {
        _entities[entity] = value;
    }

    public override bool Read()
    {
        if (_nextEntity != null)
            return true;

        return base.Read();
    }

    public override XmlNodeType NodeType
    {
        get
        {
            if (_nextEntity != null)
                return XmlNodeType.Text;

            return base.NodeType;
        }
    }

    public override string Value
    {
        get
        {
            if (_nextEntity != null)
            {
                string value = _nextEntity;
                _nextEntity = null;
                return value;
            }
            return base.Value;
        }
    }

    public override void ResolveEntity()
    {
        // if not found, return the string as is
        if (!_entities.TryGetValue(LocalName, out _nextEntity))
        {
            _nextEntity = "&" + LocalName + ";";
        }
        // NOTE: we don't use base here. Depends on the scenario
    }
}

This approach works in simple scenarios, but you may need to override some other stuff for completeness.

PS: sorry it's in C#, you'll have to adapt to VB.NET :)

like image 106
Simon Mourier Avatar answered Sep 22 '22 00:09

Simon Mourier


I haven't done this, but you could create a XmlParserContext object with required entity declarations as internalSubset. Pass that context to XmlTextReader in the constructor and create the XDocument object by loading the reader. In MSDN there already is a simple looking example code snippet in VB for using a pre-defined entity.

like image 32
jasso Avatar answered Sep 19 '22 00:09

jasso