Parsing an XML/XHTML document but ignoring errors in C#

Question

I'm writing some little applications that parse the source of a few web pages, extract some data, and save it into another format. Specifically, some of my banks don't provide downloads of transactions/statements but they do provide access to those statements on their websites.

I've done one fine, but another (HSBC UK) is proving a pain in the arse, since its source is not valid XHTML. For example there is whitespace before the <?xml?> tag, and there are places where == is used instead of = between an attribute name and its value (e.g. <li class=="lastItem">).

Of course, when I pass this data into my XmlDocument, it throws a wobbly (more accurately an exception).

My question is: is it possible to relax the requirements for XML parsing in C#? I know it's far better to fix these problems at source - that's absolutely my attitude too - but there's roughly zero chance HSBC would change their website which already works in most browsers just for little old me.

Pontus Gagge · Accepted Answer

Take a look at the HTML agility pack. It allows you to extract elements of a non-XHTML-compliant web page through XPath, as if it were a well-formed XHTML document.

And for the love of Kleene, don't try to regexp a HTML page with any kind of complexities!

Parsing an XML/XHTML document but ignoring errors in C#

Tags:

c#

xml

Ben Hymers

1 Answers

Pontus Gagge

Recent Activity

Donate For Us

Parsing an XML/XHTML document but ignoring errors in C#

Tags:

c#

xml

Ben Hymers

1 Answers

Pontus Gagge

Related questions

Recent Activity

Donate For Us