Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XMLSchema: Is it possible to calculate how valid an invalid document is (eg. as a percentage)?

I'm using lxml in Python to validate a number of XML documents against an XML Schema definition. A good number of these documents do not validate -- and at the moment they're not expected to -- but it would be useful if I could calculate how valid they are, as a percentage, for reporting purposes. I have the ability to use xmllint or other command line tools, should those be able to provide a useful statistic.

like image 812
Phillip B Oldham Avatar asked Nov 12 '22 07:11

Phillip B Oldham


1 Answers

lxml parsers provide a way to get a list of the errors that occurred while trying to parse the document. Combine this with the parser's recover keyword argument and you get something like this:

# Warning, untested, may not work
parser = etree.XMLParser(recover=True)
it_would_be_a_tree = etree.parse(your_xml_data, parser)
total_errors = len(parser.error_log)

Then you can calculate the percentage of the file that total_errors represents. You could use a naive measure, like errors per line or errors per character without any trouble. More sophisticated measures are also possible if it_would_be_a_tree is actually a tree structure (total_elements / total_errors, for example).

like image 92
Sean Vieira Avatar answered Dec 28 '22 02:12

Sean Vieira