I am using HtmlAgilityPack. Is there a one line code that I can get all inner text of html, e.g., remove all html tags and scripts?
Like this:
document.DocumentNode.InnerText
Note that this will return the text content of <script>
tags.
To fix that, you can remove all of the <script>
tags, like this:
foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
style.Remove();
I wrote a simple method. It may help you. This method can extract all specific tag's node. Then you can use the HtmlNodeCollection[i].InnerText
to get its text.
HtmlDocument hDoc;
HtmlNodeCollection nodeCollection;
public void InitInstance(string htmlCode) {
hDoc.LoadHtml(htmlCode);
nodeCollection = new HtmlNodeCollection();
}
private void GetAllNodesInnerTextByTagName(HtmlNode node, string tagName) {
if (null == node.ChildNodes) {
return ;
} else {
HtmlNodeCollection nCollection = node.SelectNodes( tagName );
if( null != nCollection ) {
for( int i=0; i<nCollection.Count; i++) {
nodeCollection.Add( nCollection[i]);
nCollection[i].Remove();
}
}
nCollection=node.ChildNodes;
if(null != nCollection) {
for(int i=0;i<nCollection.Count; i++) {
GetAllNodesInnerTextByTagName( nCollection[i] , tagName );
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With