Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Library for parsing XHTML files with XLINQ

When I realized I needed to create an index for approximately 50 XHTML pages, which may be added/deleted/renamed/moved in the future, I thought "No problem -- I'll write a quick index generator using LINQ to XML, since XHTML definitely counts as XML".

Of course, as soon as I tried running it, I found out about the fact that XLINQ chokes on XHTML entities like  . I got around it by using the following algorithm:

  1. Read XHTML file into string.
  2. Use regex search and replace on that string to add a section into the DOCTYPE that defines all relevant entities (because I only care about the "title" attribute in the files I read and my output file does not use any entities right now, it just sets them all to blank, but I may add the actual values later).
  3. Parses the result into an XDocument.

To save a file, I do the opposite:

  1. Save XDocument to a string.
  2. Strip out the entity definitions.
  3. Save to file.

My question is, are there any libraries (especially built-in .Net ones) I can use that will read XHTML files into XDocuments? The code I wrote has accomplished its purpose (to generate the current index and to test the rest of the generator program), and I would really prefer not to spend time testing it if someone else already wrote and tested the same thing.

Thank y'all so much for your time,
Ria.

Edit: Thank you so much; this works! I still have to do a little string processing when I save the XHTML (guess the library was not really made for that:)) and I had to fiddle with the source of the Agility Pack slightly to get it to stop indiscriminately sticking a CDATA section around the insides of every style attribute (even when there was already one present), but that's the point of Open Source, right?

like image 458
Ria Avatar asked Jan 28 '09 09:01

Ria


1 Answers

This might be helpful: LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter

like image 127
Gonzalo Quero Avatar answered Oct 17 '22 05:10

Gonzalo Quero