Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with malformed XML [duplicate]

Tags:

xml

perl

I'm dealing with malformed XML in perl that's generated by an upstream process that I can't change (seems like this is a common problem here). However, as far as I've seen, the XML is malformed in only one particular way: it has attribute values that contain unescaped less-than signs, e.g.:

<tag v="< 2">

I'm using perl with XML::LibXML to parse, and this, of course, generates parse errors. I've tried using the recover option, which allows me to parse, but it simply stops when it encounters the first parse error, so I'm losing data that way.

It seems like I have two general choices:

  1. Fix the input XML before I parse it, perhaps using regular expressions.
  2. Find a more forgiving XML parser.

I'm leaning towards option 1, as I'd like to catch any other errors with the XML. What would you recommend? If #1, can someone guide me through the regex approach?

like image 796
disruptiveglow Avatar asked Dec 05 '22 22:12

disruptiveglow


2 Answers

I know this isn't the answer you want - but the XML spec is quite clear and strict.

Malformed XML is fatal.

If it doesn't work in a validator, then your code should not even attempt to "fix" it, any more than you'd try and automatically 'fix' some program code.

From The Anotated XML Specification:

fatal error [Definition:] An error which a conforming XML processor must detect and report to the application. After encountering a fatal error, the processor may continue processing the data to search for further errors and may report such errors to the application. In order to support correction of errors, the processor may make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor must not continue normal processing (i.e., it must not continue to pass character data and information about the document's logical structure to the application in the normal way).

And specifically the commentary on why: "Draconian" error-handling

We want XML to empower programmers to write code that can be transmitted across the Web and execute on a large number of desktops. However, if this code must include error-handling for all sorts of sloppy end-user practices, it will of necessity balloon in size to the point where it, like Netscape Navigator, or Microsoft Internet Explorer, is tens of megabytes in size, thus defeating the purpose.

If you've ever tried to put together a parser for HTML, you'll realise why it needs to be this way - you end up writing SO MANY handlers for edge cases, bad tag nestings, implict tag closure that your code is a mess right from the start.

And because it's my favourite post on Stack Overflow - here is an example of why: RegEx match open tags except XHTML self-contained tags

Now I appreciate this isn't always an option, and you probably wouldn't come here if asking your upstream 'fix your XML' was the path of least resistance. However I would still urge you to report it as defect in the XML originating application and as much as possible resist pressure to 'fix' programatically - because as you've rightly figured out, it's building yourself a world of pain when the right answer is 'fix the problem at source'.

If you are really stuck on this road, you can - as Sinan Ünür points out - your only option is to trap where you parser failed, and then inspect and try to repair as you go. But you won't find an XML parser that'll do it for you, because the one that do are by definition broken.

I would suggest that first you:

  • Dig out a copy of the spec, to show to whoever's asked you to do this.
  • point out to them that the whole reason we have standards is to promote interoperability.
  • Therefore that by doing something that deliberately violates the standard, you are taking a business risk - you are creating code that may one day mysteriously break, because using things like regular expressions or automatic fixing is building in a set of assumptions that may not hold true.
  • A useful concept here is technical debt - explain you're incurring technical debt by automatic fixing, for something that's really not your problem.
  • Then ask them if they wish to accept that risk.
  • If they do think that's an acceptable risk, then just get on with it - you may find it worth - effectively - ignoring the fact that your source data looks like XML and treat it as if it were plain text - use regular expressions to extract pertinent data lines, etc.
  • Stick an apology in the comments to your future maintenance programmer, explaining who made the decision and why.

Also might be useful as a reference point: Which character should not be set as values in XML file

like image 194
Sobrique Avatar answered Jan 02 '23 11:01

Sobrique


One option is to catch the exceptions, figure out where in the input they occurred, fix the input there, and retry.

The following is a quick, inefficient proof-of-concept script using XML::Twig because I still haven't figured out how to build & install libxml2 from scratch on Windows.

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

my $xml = q{ <tag v="< 2"/> };

while ( 1 ) {
    eval {
        my $twig = XML::Twig->new(
            twig_handlers => { tag => \&tag_handler },
        );
        $twig->parse( $xml );
        1;
    } and last;

    my $err = $@;

    my ($i) = ($err =~ /byte ([0-9]+)/)
        or die $err;

    substr($xml, $i, 1) eq '<'
        or die $err;
    $xml = substr($xml, 0, $i) . '&lt;' . substr($xml, $i + 1);
}

sub tag_handler {
    my (undef, $elt) = @_;
    print $elt->att('v'), "\n";
}

I wrote more about this on my blog.

like image 42
Sinan Ünür Avatar answered Jan 02 '23 09:01

Sinan Ünür