What is the best way to get a plain text string from an HTML string?
public string GetPlainText(string htmlString) { // any .NET built in utility? }
Thanks in advance
html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Escape all special characters. Output is less readable, but avoids corner case formatting issues.
You can use MSHTML, which can be pretty forgiving;
//using microsoft.mshtml HTMLDocument htmldoc = new HTMLDocument(); IHTMLDocument2 htmldoc2 = (IHTMLDocument2)htmldoc; htmldoc2.write(new object[] { "<p>Plateau <i>of<i> <b>Leng</b><hr /><b erp=\"arp\">2 sugars please</b> <xxx>what? & who?" }); string txt = htmldoc2.body.outerText;
Plateau of Leng 2 sugars please what? & who?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With