Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Html white list in c#

Spent about 30 minutes or so on SO looking for a definitive solution to this problem.

This question seems to have been asked a lot of times but...

  • Most solutions use regular expressions.
  • There are a lot of posts saying that regular expresions should not be used to process html.
  • There are lots of answers simply giving a link to the HTMLAgilityPack (on Codeplex) but no real examples of how to use this pack to meet the stated requirements.

So I am looking for the best solution to meet the following requirements.

  • I want to provide an allowed list of HTML tags.
  • Any tags not in the allowed list should be removed along with their attributes and contents.
  • Any tags in the allowed list should be preserved with attributes and contents.
  • The solution should cope with differnet localisations - it is possible users using languages and character sets other than those used in English will be used.
  • [Added] The solution should handle text such as a forum post as opposed to a full html page - so tags such as b u i etc would be allowed but script div etc are not allowed and should be removed.

I am looking for a C# solution and if its best to use a RegEx then I am happy to do so. If there is an existing library that can do this I am also happy to use them. I would appreciate some example code where possible.

I am looking for a definitive and tried and tested method of solving this problem as opposed to extensive debate + closed posts etc :) :)

Thanks in advance.

like image 791
Remotec Avatar asked Nov 14 '22 21:11

Remotec


1 Answers

You can use the Html Agility Pack for parsing the HTML. Then you can work with the elements the way you like and write it back to HTML again.

like image 158
Matthias Avatar answered Nov 23 '22 23:11

Matthias