Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing an XML/XHTML document but ignoring errors in C#

Tags:

c#

xml

I'm writing some little applications that parse the source of a few web pages, extract some data, and save it into another format. Specifically, some of my banks don't provide downloads of transactions/statements but they do provide access to those statements on their websites.

I've done one fine, but another (HSBC UK) is proving a pain in the arse, since its source is not valid XHTML. For example there is whitespace before the <?xml?> tag, and there are places where == is used instead of = between an attribute name and its value (e.g. <li class=="lastItem">).

Of course, when I pass this data into my XmlDocument, it throws a wobbly (more accurately an exception).

My question is: is it possible to relax the requirements for XML parsing in C#? I know it's far better to fix these problems at source - that's absolutely my attitude too - but there's roughly zero chance HSBC would change their website which already works in most browsers just for little old me.

like image 488
Ben Hymers Avatar asked Mar 22 '26 08:03

Ben Hymers


1 Answers

Take a look at the HTML agility pack. It allows you to extract elements of a non-XHTML-compliant web page through XPath, as if it were a well-formed XHTML document.

And for the love of Kleene, don't try to regexp a HTML page with any kind of complexities!

like image 57
Pontus Gagge Avatar answered Mar 23 '26 21:03

Pontus Gagge