I'm using lxml
in Python to validate a number of XML documents against an XML Schema definition. A good number of these documents do not validate -- and at the moment they're not expected to -- but it would be useful if I could calculate how valid they are, as a percentage, for reporting purposes. I have the ability to use xmllint
or other command line tools, should those be able to provide a useful statistic.
lxml
parsers provide a way to get a list of the errors that occurred while trying to parse the document. Combine this with the parser's recover
keyword argument and you get something like this:
# Warning, untested, may not work
parser = etree.XMLParser(recover=True)
it_would_be_a_tree = etree.parse(your_xml_data, parser)
total_errors = len(parser.error_log)
Then you can calculate the percentage of the file that total_errors
represents. You could use a naive measure, like errors per line or errors per character without any trouble. More sophisticated measures are also possible if it_would_be_a_tree
is actually a tree
structure (total_elements / total_errors
, for example).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With